








— <a 










The Journal of 
| ducational Psychology | | 


Devoted Primarily to the Scientific Study of Problems of 
Learning and Teaching. 





BOARD OF EDITORS: 


HAROLD ORDWAY RUGG, Chairman. RUDOLF PINTNER, an 
Lincoln School of Teachers College. Teachers College, Columbia University. 


sathere Oallege, Columbia Unteereliy. 
rome © : “ BEARDSLEY RUML, 


Carnegie Corporation, New York City. 
JAMES CARLETON BELL, 


ee ee ee 


rookl ining School for Teachers. LEWIS MADISON TERMAN, 

wien nesta onnepuancadalini Leland Stanford University. 
FRANK NUGENT FREEMAN, : 
University of Chicago. EDWARD LEE THORNDIKE, 


ARTHUR IRVING GATES Teachers College, Columbia University. 


Teachers College, Columbia Unwwersity. 


_ - [Fm oar. an - 








VIVIAN ALLEN CHARLES HENMON, LAURA ZIRBES, Assistant Editor. j 
University of Wisconsin. Lincoln School, of Teachers College. | 
Ex MARCH, 1924 re Bd 








- CONTENTS 


THE RELATION OF QUALITY AND SPEED OF PERFORMANCE: A | 
FORMULA FOR COMBINING THE Two IN THE CASE oF HAND- | 
WRITING. Arthur I. Gates, | 


RELIABILITY OF Scoot Tests or Aupitory Acuity. Harvey A. 
Peterson and JeromeG. Kuderna. ............ 145 


An Emprricat Stupy or THE Various METuHops or ComMBINING 
INCOMPLETE ORDER OF Merit Ratines. Henry E. Garrett .157 


HERRING REVISION OF THE Brnet-Smon Tests. John P. 


RG a Ree ge aS als dele Sh dw & MEE eer 'e 172 
One Inpustry’s ATTITUDE TOWARD SELECTION BY MENTAL 

Luveis. C. Marcus Wienand. ..........4.6. 180 
Oe Shere ek 183 


Nores on ARTICLES IN EpUCATIONAL PsycHoLoGy IN CURRENT 





Issues oF OTHER MAGAZINES. ...... ps genie MA UAL 184 
New PvusticaTions In EpucaTIONAL PsycHoLocy AND RELATED 
Freups or EpUCATION. ......+242e0.0-ece0ce8ce.e. 186 





Published Monthly Except June to August by 
WARWICK and YORK, Inc., 
: York, Pa. Baltimore, Md. 








, — 
PESADOS a 
aa Second Class matter November 15, 1921, at the Post Office at York, Pennsylvania, under the Act of March 3, 1879, 














ational Summer 
School 


PECAAL footy of 14 cclcboated echelons ond 1 


eminent lecturers, represen’ colleges 
$ oa ana orien ado tem, une 9 to 


July 18. Advanced and courses, gradu- 
oo aleaueern tea Fionn." cer- 


. — includes: —~ ga Angell, Yale, at 
vard; Blackwelder Branson; Univ. ! 
Cowles, Chicago; Fransen, U. of Cal.; U. of 


Board and Room at $20 a month and up. 


Use summer vacation rates, routing via Yellowstone Park—if you 
wish—with adequate stopover at Logan at no extra charge. 


Summer quarter: June 9 to August 29 


nd Term july 21 to Avgust 20 
erm: 
Register June 6 or 7 

Write for catalog. 


Utah Agricultural 
Logan cet = Utah 


thousands 4, Tlew — — 
pronounced, and ‘New | 


INTERNATIONAL DICTIONARY 


Get the 
Best 





















Here are 
a Few Samples: 


Esthonia sippio Ruthene 
aerograph _askari broadcast 
Blue Cross cyper agrimotor — 
rotogravure  stellite Devil Dog 
Air Council sterol hot pursuit 
mystery ship taiga abreaction 
capital ship sokol activation 
affectivity Swaraj photostat 
mud gun realtor overhead 
megabar soviet 


Is this Storehouse 
of Information 










G. & C. MERRIAM CO., Springfield, Mass., U.S.A. 


in 
quarters of 


PR on npn 


| the m of the N. E. A. in 


UNIVERSITY OF VIRGINIA | 


SUMMER QUARTER 


First Tom. June 16-July 26 
Second Term, July 28-August 30 


Courses for Elem Teachers 
Coures for High School Teachers 
The Summer Quarter is an integral part of 


the University year, the courses being th 
character and oredit value as in’ the other 


are coniued upon men and women 
= -« ° 9 ere 

e "s obtained b 
operly qualified ctedinte “1 in three Summer 


It offers opportunities unexcelled in the South 


| and makes a strong a to teachers see king 
soci 


broader ——— ng and wilier 0 


1 contacts, and to college students desiring to com- 


co Jast, Quarter, 2881 from thirty. 


The most beautiful an — Galane educational 
plant in America 
Accommodations at reasonable rates. Tuition 


| | for non-Virginia students $20.00 term. 


Entertainments, = Festi excursions. 
Reduced rates applied for. 

eachers registered in the Summer Quarter this 
year will have an onealient opportunity to attend 
Washington to 


li cactaiihacta well has toon 


A. ustrated folder and full announcement, 


SECRETARY, SUMMER QUARTER 
UNIVERSITY, VIRGINIA. 











EDWARD DAVIS 
Manager 


HOTEL RENNERT 


BALTIMORE, MARYLAND 











————— 


a 24 88 








THE JOURNAL OF 
EDUCATIONAL PSYCHOLOGY 


Volume XV March, 1924 Number 3 














THE RELATION OF QUALITY AND SPEED OF 
PERFORMANCE: A FORMULA FOR 
COMBINING THE TWO IN THE 
CASE OF HANDWRITING! 


ARTHUR I. GATES 
Teachers College, Columbia University 


Tue GENERAL PROBLEM 


In nearly all tests of capacity or achievement, the speed of perform- 
ance—the time required to do a task, or the number of tasks done in a 
given time—is a partial, often the main, determiner of the final score. 
In both mental and motor tests, it is rarely known to what degrees the 
method of computing scores emphasizes or ‘‘weights’’ speed as com- 
pared to quality of performance. It appears, however, that in many 
current tests, the method of scoring or of combining scores gives rate 
of performance a relatively heavy weight. So far as I am aware, no 
one has determined in a thorough way the relation of speed and quality 
of any function. Yet it is quite improbable that optimum scores, in 
which the two variables are properly combined, may be secured until 
the relations of the two variables in the case of a typical individual 
performer have been ascertained. 

This paper contains a report of studies designed to yield some 
information concerning the relation of quality and speed of performance 
in the case of one function—handwriting. This function was chosen 
because there are available “scales’’ to facilitate the gauging of the 
quality of performance, because it is a function permitted large 
changes in rate of performance and because, in case a reliable means 





1The data upon which this study is based were gathered partly in the Scar- 
borough School in 1921 and partly in the Horace Mann School of Teachers College, 
1923. For expert assistance in giving tests or treating results or both, I am 
indebted to the following: Miss Theodosia Bay, Miss Esther Hemke, Miss Jessie 
La Salle, and Miss Dorothy Van Alstyne. 
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of combined speed and quality should be discovered, the results would 
be of immediate and widespread practical value. Concerning the 
last, the particular and practical problem, certain remarks may be in 
order. 


Tue ParTICULAR PrRoBLEM. A CoMBINED SCORE FOR QUALITY AND 
RaTE OF WRITING 


At present, the results of a test in writing are presented in terms of 
two variables: (1) Quality, as estimated by the use of some standard 
scale, and (2) speed, usually as the number of letters written per minute. 
For purposes of diagnosis of individual ability, the display of these 
two variables gives, under certain conditions, the most significant 
facts; we need to know whether the pupil is emphasizing speed relative 
to accuracy too little, too much, or properly. These facts may be 
reliably ascertained however, only when the test represents habitual 
practice. If it does the results of several tests given, say, on successive 
days should yield approximately the same results—speed and quality 
similarly related. The uniformity of results, we have found, depends 
in an appreciable measure upon the care and uniformity with which 
instructions are given. But even with the most careful and competent 
supervision, the relations of rate and quality for many individuals vary 
widely in successive tests. 

For example, in tests given on successive days in October the follow- 
ing results were obtained: 


Frrst Dar’s Test Szeconp Dar'’s Tzsr 

QUALITY SPreEep QUALITY Spreep 
Soe ee re eee 9 23 8 34 
NEESER Tey ies See ae ae 8 31 7.5 39 
Sis cp h iskae ds di 7 40 7 43 


On the basis of the first day’s score, which pupil has the greatest 
general ability? Comparing results of the second day with those of 
the first, which pupils have done better or worse, and by what amounts? 
Consider also the difficulty in comparing the results of tests given in 
October and the following January for certain children: 


OcrToBER JANUARY 


QUALITY SprrEep QUALITY Sprrerep 
TT SE SET TNE 8 35 8 39 
ht, an cx care 10 42 10 83 
Oe 7 39 s 38 
| alligiag SE A a see 9 54 8 77 


During the four months of school, which of these pupils have 
improved or lost and how much? None of these questions can be 
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answered except by rough methods of inspection. For purposes of 
computing individual differences, measuring gains, and for such statis- 
tical manipulation, as obtaining correlations between writing and other 
abilities, some single measure is needed. 

In the search for a suitable device, it became apparent at once that 
neither speed nor accuracy alone would suffice. Combining Grades 
V and VI (47 pupils) the correlation between speed and quality was 
computed and found to be 0.23, a figure entirely too low to justify the 
use of either as a measure of total ability. Of those pupils (18) who 
wrote at quality 9, for example, the number of letters per minute varied 
from 25 to 78; of those who wrote from 49 to 54 letters per minute, the 
qualities varied from 12 to 6. To represent general writing ability a 
score combining speed and quality properly weighted must be found. 


Tue ForRMULA FOR COMBINING SPEED AND QUALITY OF WRITING. 
First EXPERIMENT 


The search for a formula by means of which speed and quality 
could be combined was empirical, beginning with a study of the two 
variables in sharply contrasting methods of writing. Two groups, 
Grades V and VI, were asked on different occasions towrite (1) ‘‘as 
well as they could,’”’ (2) “‘as fast as the could,” and (3) at their ‘“‘ordi- 
nary speed and quality.” These three types of writing will be desig- 
nated as “‘quality,”’ “speed,” and “‘normal,’’ respectively. The order 
of tests was for Grade V, 1, 3, 2; for Grade VI, 2,3, 1. All other 
conditions, the text written, etc., were the same in all the several trials. 
Quality was judged by a competent person unacquainted with the 
experiment or children. Rate was the number of letters per minute 
(average of two two-minute tests). 

Proceeding on the assumption that general writing ability would 
be the same whether expressed in normal, high quality, or high speed, 
a formula was sought which would yield by combination of quality 
and speed, the same numerical result in all three cases. After some 
manipulation, the following formula, which will be later justified and 
applied, was found to fulfill this condition: 


Combined score = Quality X Wspeed (1) 


To take a single illustrative case, for a pupil who wrote 34 letters 
per minute at quality 8 on Thorndike’s scale— 


Combined score = 8 X 7/34 = 8 X 3.2 = 25.6 


~, ee 
ve _$ (i fewer | eee, 
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The original scores for Quality and Speed, and the combined scores 
for the pupils of Grade V, for “quality,” “speed,” and “normal” 
writing are given in Table I, together with the averages for this grade 

































































and for Grade VI. 
TaBie | 
Grade V 
. . 
Separate scores | Combined scores derived by 
| formula (1) 
. wae “ aan “ | one ane 
Pupil Writing as Writing Writing ‘as | : Writing Writing _, Writing 
well as you fast as you as well as as fast as 
a normal R a normal mA 
can’: can | you can you can 
Comp. Comp. Comp 
Q Sp. Q Sp. Q Sp. score score score 
A 9 28 8 43 8 53 27.3 28.0 30.0 
B 10 24 9.5 35 9 45 28.8 31.1 31.9 
Cc 9 29 8 39 7 54 27 .6 27.1 26.4 
D 10 25 9 33 8 39 29.2 28.8 27.1 
E 9 30 S) 37 9 45 27.9 29.9 31.9 
F 10 31 8 63 x 70 31.4 32.0 32.9 
G 10 43 10 63 9 71 35.0 40.0 37.3 
H 10 24 9 43 5 49 28.8 31.5 29.2 
I 9 40 i) 60 s 76 30.7 35.2 33.8 
J 10 35 i) 43 9 53 32.7 32.5 33.7 
K 10 62 9 53 9 71 39.5 33.7 37.2 
L 9 43 8 61 7 68 31.5 31.4 28.6 
M 9 28 9 37 s 49 27.3 29.9 29.2 
N 11 34 11 44 9 55 35.5 38.8 34.2 
Oo 11 2h 10 57 10 55 32.1 38.4 38.0 
P 13 49 13 53 12 74 47.5 48.7 50.0 
Vv 7 45 7 43 7 42 24.8 24.5 24.3 
R 9 49 ek 56 S 61 32.8 25.6 31.4 
Ss 11 34 10 | 56 10 60 35.5 38.2 39.1 
T 11 52 10 73 10 89 41.0 41.7 44.6 
9.85 36.5 9.175) 49.6 8.65 58.9 32.34 33.35 33.54 
Average results for Grade VI 
10.375) 45.7 9.41 64.7 8.75 76.3 | 36.70 37.70 37.10 
Grades V and VI combined 
34.52 35.52 35.32 









































Considering first the average results for Grade V, it is seen that 
writing as well as they could, the quality score is 9.85, speed 36.5 
letters a minute; writing “normally,” quality is 9.175 and speed 49.6; 
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writing at top speed, quality is 8.65 and speed 58.9. The means for 
the combined scores derived by Formula (1) are 32.34, 33.35, and 33.54 
for quality, normal and fast writing respectively. For Grade VI the 
results are similar. Combining the average derived scores for both 
grades the derived scores are 34.32, 35.52 and 35.32, respectively. 
The average of the differences between these scores is 0.8 or 2.3 per 
cent of the average score. 

The uniformity of the average results is really astonishing, when the 
conditions are considered. Pushing children to the extremes of quality 
and speed, functioning in non-habitual manners, tends to increase the 
unreliability of the performance. For example, in writing as well as 
one can, it would scarcely be expected that maximum efficiency or uni- 
formity would result. The combined scores for both grades were a 
trifle lower than those obtained under normal or fast writing, indicating 
that probably some or many wrote unnecessarily slowly in producing 
the maximum quality. The scoring of the specimens is also a source 
oferror. If there were a constant error, either in the Thorndike scale 
or in the estimations of the judge, at any one of the average levels, it 
would tend to make the average combined scores unequal. Despite 
these possibilities, they are about as nearly equal as three successive 
tests with any reliable instrument, such as a reading or arithmetic 
test, would provide. 


VALIDATION OF THE FoRMULA; SECOND EXPERIMENT 


A second experiment, designed to test the validity of the formula, 
was conducted in October, 1923, upon pupils of Grades III, IV, V and 
VI in the Horace Mann School. The average number of pupils per 
grade was about twenty-four. 

After a preliminary “‘warm up”’ exercise of one-half minute with 
the pencil, the selection ‘‘The quick brown fox jumped over the lazy 
dog’’ was written at normal speed for two minutes. Following this, 
the pupils (in the case of Grades III and V) were asked to write faster 
than in the first tests. After a rest, they were again asked to write 
normally, and finally after another rest, to write a better quality than 
on the preceding tests. Instructions were the same for Grades IV 
and VI except that the order was (1) normal, (2) quality, (3) normal, 
(4) speed. 

All of the specimens were scored independently by three judges 
who were believed to be exceptionally able. They were first judged 


1 Miss Ella Woodyard, Mr. Jacob Orleans, and Mr. Paul Witty. 
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by grades in random order, and then reviewed by comparing the four 
specimens of each pupil. All judges reported making but few changes. 
No judge knew anything about the instructions that were given the 
pupils or about the purpose of the study. 

The conditions of this, the second experiment, differ in certain 
important respects from those of the first. In addition to new pupils, 
new judges of quality, and a larger number of tests, the writing did not 
go to the extremes of “speed’’ and ‘‘quality,’”’ and two samples of 
‘normal’? were secured. The variations in speed under different 
instructions are shown in the accompanying table. 


TaBLE II.—NvuMBER oF LETTERS PER MINUTE WRITTEN UNDER THE INSTRUCTIONS 
INDICATED. Horace Mann Scuoot Data. NUMERALS IN PAREN- 
THESES GIVE THE ORDER OF THE TEST FOR THAT GRADE 


Grade III 

NORMAL SPEED NORMAL QUALITY 

14.71 (1) 17.98 (2) 17.98 (3) 17.78 (4) 
Grade IV 

33.39 (1) 36.26 (4) 31.26 (3) 26.73 (2) 
Grade V 

44.04 (1) 51.90 (2) 45.88 (3) 38.61 (4) 
Grade VI 

69.93 (1) 76.77 (4) 65.45 (3) 52.31 (2) 


Pupils in Grade III who were at this time poor writers, failed to 
modify their writing appreciably under the several types of instruction. 
All tests yielded essentially samples of ‘‘normal’’ writing; the low effi- 
ciency in Test I being due, probably, to insufficient ‘““warm-up.”’ The 
other grades gave the variations desired. On request, they wrote 
faster or better without going to the extremes as in the case of the 
Scarborough results. 

For each specimen, there were three independent judgments and 
an average of the three. The cube root of the number of letters written 
per minute was computed for each individual and test. As a first 
test of the validity of the formula, the average quality for each grade 
(based on the combined estimates of the three judges) was multiplied 
by the average of the cube root of speeds for the pupils of each grade. 
The resulting average “combined scores’’ appear in Table III. The 
numerals in the last column, 7.e., ‘‘average differences between scores”’ 
are the averages for the differences between “normal’’ and “‘quality,’’ 
“normal” and “normal,’’ “normal and “fast,’’ “‘quality’”’ and “nor- 
mal,’’ etc. for the grade. The figures tell how much, on the average, 
one combined score differs from another. 
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Taste III.—Comsinep Score (Quauity Times Cusse Root or Spzep) Basep 
UPON THE AVERAGE OF THREE JUDGMENTS OF QUALITY. EacH 
NUMERAL IS THE AVERAGE FOR THE GRADE INDICATED 


AVERAGE 
DIFrreRENCE 

NorMAL UALITY NORMAL Fast BETWEEN 

WRITING RITING WRITING WRITING Scores 

III 19.88 21.43 21.57 21.59 0.88 

IV 27.71 27.90 27.88 27.41 0.27 

V 32.38 32.95 32.99 32.73 0.34 

VI 39.86 38.34 39.19 39.36 0.67 
Bicccssscsscs care 30.15 30.41 30.27 0.54 


Inspection of Table III reveals a high degree of reliability in the 
formula. The average difference between two scores based on two 
tests is about one-half a unit on the basis of a mean score of about 30. 
These variations are really very small and are due in part, if not mainly, 
to the combined effects of the variability of performance from test to 
test and the unreliability of the appraisals of quality. The first score 
for Grade III which is the lowest for that grade, for example, may be 
mainly due to incomplete ‘‘warm-up.” 

Incidentally, the combined scores display clearly the grade advance- 
ment in achievement. 

The degree to which the variations in the average combined scores 
may be due to the unreliability of judgments of the quality of the 
specimens is suggested in Table IV, in which the combined scores 
obtained from the quality estimates of each judge are given 
individually. 

Judge Wy. consistently scores the specimens a little lower than the 
other two who agreed very closely on the whole, but for all three the 
formula seemed to hold very well and about equally well. In Grade 
III, the first test gave a low combined score, but elsewhere the devia- 
tions seem to be mainly due to errors in judgment, distributed at ran- 
dom, or to variations in the performance during the several tests. A 
significant fact is that, except for Grade III, the discrepancies among 
the results of the several tests (as indicated by the numerals at the foot 
of the column) are less great than the variations on the same test 
brought about by differences in the scoring of quality by different 
judges (as indicated by the larger figures at the extreme right). Fur- 
thermore, when the scores for the three judges are combined (as shown 
in Table III) the combined scores are more uniform. The implica- 
tion is, then, that the more perfect the judgment of quality of writing, 
the closer the formula fulfills its requirement. With perfect judgments 
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Taste IV.—AvEeRAGE ComBINED Scores OsTAINED BY USE OF THE ForMULA, 
QuaLity SCORED By A SINGLE JUDGE 


AVERAGE 
D1FFERENCE 
AMONG 
OrperR or TEstTs Jupez W. Jupaez O. Juper Wry. JuDGEs 
Grade III. 21 pupils 
SR nuked wuineeenoee 19.72 20.58 19.33 .96 
SATIS SY ee re op ee 21.27 22.20 20.82 .67 
cis ad beds eh eeedee 21.35 22.48 20.88 .86 
SRE FG Sh Se 22.08 22.34 21.35 .66 
Average difference among tests. 1.16 0.97 1.01 
Grade IV. 22 pupils 
NO Se ere ye 27.95 28.87 26.30 1.71 
is ncdg nh ee4 esse ses 5 28.54 27.81 25.89 1.77 
keh tc acce ssh oe swe 28.48 28.71 26.46 1.50 
GN dis vekwsss ee bboend 28.70 28.34 26.76 1.29 
Average difference............ 0.39 0.57 0.46 
Grade V. 27 pupils 
EEL LE nT ee 33.41 32.38 31.33 1.39 
ines i vnk 600s a0 86s 33.33 33.43 31.42 1.34 
ns wb aie es eee os 33.66 34.05 31.25 1.87 
4. Quality...... sn tats pro le ace 34.58 33.58 30.69 2.93 
Average difference............ 0.67 0.86 0.38 
Grade VI. 25 pupils 
De ccbbid cead ed eara ee 41.01 40.03 38.53 1.65 
ee ee 39.74 37.64 37.63 1.41 
FO ESS ee 40.31 38.74 38.52 1.19 
id lie ied hee oe 0 39.44 40.09 38.55 1.03 
Average difference............ 0.96 1.44 0.46 


of truly representative performances, the formula would show very 
nearly, perhaps quite identical scores. 

The formula, in sum, has adequately stood the test of six different 
classes, Grades III to VI inclusive, in two schools where three or four 
specimens, at different levels ‘‘as fast as you can write,” “faster than 
normal,” “normal,” “better quality than normal,” and “the best 
quality you can write’’ have been secured. The same results were 
obtained from quality scores estimated by four independent judges, 
none of whom had any knowledge whatever of the purpose of the 
study or the character of the formula. 


ILLUSTRATIONS OF SOME OF THE USES OF THE FORMULA FOR COMBINING 
SPEED AND ACCURACY OF WRITING 


The use of the combined score, to assist in the interpretation of class 
averages, is indicated in the following: 
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Grade VI test: 


October 15, 1921, mean score, quality 9.44, speed 53.0 
December 1, 1921, mean score, quality 9.41, speed 64.7 
January 15, 1922, mean score, quality 9.70, speed 63.7 


It would be quite impossible to evaluate quantitatively the gain 
represented in these scores. Applying the formula, the following 
average scores are obtained: 


Combined scores: 


October, 35.3 
December, 37.7 (gain over October 2.4) 
January, 38.8 (gain over December 1.1; over October 3.5) 


Thus the teacher may secure a useful measure of gross improvement in 
writing in general, as well as the impression. of whether progress has 
been mainly in speed or quality, as shown by the separate scores. 

How uniform are the results given by the formula for the individual 
pupils is shown in Table I. The average difference between the com- 
bined scores for the three tests is 0.8. Study of the individual scores 
will show occasional discrepancies of considerable magnitude. G. 
does relatively well in ‘‘normal’’ writing; a result, which like many 
others may be due to an error in scoring quality, and which was in 
every case save one (B) graded in terms of whole steps on the scale. 
In seven cases, the quality on ‘‘as well as you can’’ was graded equal 
to “normal,” in 10 cases “normal’’ was judged equal to “fast as you 
can,” and in two cases “‘quality’’ was called equal to ‘‘speed”’ in 
quality. Inmost of these the qualities of the samples probably differed 
but was not distinguished. These errors are equalized in the average, 
at least approximately; but no correction formula, of course, will correct 
the errors of observation or eliminate variability in performance. 
Except for one case, in Grade VI, which showed a higher quality under 
“normal’’ than under “‘quality”’ no extraordinary peculiarities were 
revealed which would cast serious question upon the validity of the 
formula, for use with individual cases. Cases like G, K, O, and R are 
to be expected. 

A matter of some importance is the maintainance of a range of 
abilities in the combined scores, sufficiently wide to disclose individual 
differences with sharpness. The combined scores under “normal’”’ 
range from 24.5 for J to 48.7, or approximately twice as high, for P. 
The average deviation from the mean is approximately 5.0. The use 
of the formula, therefore, does not compress the range to a degree that 
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makes discrimination difficult; indeed, compared to the distribution of 
quality scores the range is greatly enlarged. 

The range and the uniformity of scores for the same individual in 
the three tests both influence the correlations between the three tests; 
the wider the range of scores, other things being equal, the higher the 
correlations, and the more uniform the individual’s scores, the higher 
the correlations. These correlations, usually called reliability coeffi- 
cients, are (results of Grades V and VI combined): 





High quality test with mormal....................000 ees euee .911 + P.E. .02 
High quality test with high speed test....................... .890 + P.E. .02 
High speed test with normal.................. 00.00. e eee eee .901 + P.E. .02 

i ie cis Ae ee ee es oh keane s he hans .90 + P.E. .01 


These reliability coefficients are “high’’ but their significance is 
more clearly disclosed by comparisons with other tests. For the same 
groups, the Burgess reading rest (two different forms) gives a coefficient 
of .78; spelling (two lists of 50 words) .90, and Woody arithmetic .92. 
The derived writing score then occupies a respectable position in terms 
of reliability even when two of the tests were conducted at unusual rates. 

While the combined score serves excellently such purposes as com- 
paring individuals or classes in achievement, measuring gross progress, 
or for use in experimental and statistical inquiries, it has, taken alone, 
one defect shared by nearly all ‘‘derived’”’ scores—its meaning is not 
instantly apparent; it is neither quality nor speed but a somewhat 
complex combination of both. For many purposes, it is preferable to 
use a@ score expressed in terms of a single variable—speed in terms of 
quality or vice versa. To illustrate: The average score for our Grade 
V in October was approximately quality 9; and many teachers may feel 
that there is more meaning in a score which represents the rate at which 
each child would write, theoretically if not actually, if he were writing 
at this quality. Some might prefer to hold speed constant stating the 
quality at which the child would write were he to maintain that speed. 

These may be easily derived. 





Since 
| Combined score = Quality X ~/Speed (1) 
Then 
; Combined score 
lity = = 2 
Quality \/Speed (2) 
Also 





3 __ Combined score 
v Speed = Quality 
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oe 





Combined score\3 
Speed = ( Quality (3) 
To illustrate the derivation of the speed at which a given quality 
may be written, we need only a test of both, e.g., Pupil X writes 34 


letters per minute at quality 8. At what quality would he write at 25 
and at 45 letters per minute? Using Formula (1) we get— 


Combined score = 8 X ~/34 = 25.6 


Using Formula (2), solving for 25 letters per minute we get: 


: 25.6 25.6 
Quality = 9/35 oe 8.8 





Solving for 45 letters per minute we get: 





‘ 25.6 25.6 
Quality = 45 = 356 = 7.2 


Doubtless most teachers, being more familiar with variations in 
speed than with variations in quality in terms of a writing scale, 
would prefer to have the latter constant, the former varying. They 
would prefer answers to the questions—if X can write 34 letters per 
minute at quality 8 how rapidly would he write at quality 9, the class 
average, or at quality 10 which is equalled by but a few in the class? 

These facts can be readily determined: Using Formula (3) solving 
for quality 9, we have 


Speed = (=*) = 2.84% = 22.9 


Solving for other qualities, results will be obtained as shown in Table V. 


Taste V.—Tue Rates aT Wuicn X, WHo Wrote 34 Letters PER MINUTE AT 
Quauity 8, Woutp Write aT HIGHER AND LOWER QUALITIES 


LETTERS 
QUALITY PER MINUTE 
12 9.6 
ll 12.1 
10 16.7 
9 22.9 
8 34.0 (demonstrated by test) 
7 49.0 
6 77.8 


X would write about one-half as fast as quality 10 and more than 
twice as fast at quality 6 as he actually does write at quality 8. By this 
formula all the members of a class may be compared on a basis of speed 
with quality the same. Thus for the first six individuals whose scores 
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are shown in Table I (from the data given under ‘“‘normal’’ writing) 
we may estimate the speed at which they would write at quality 10. 
The results appear in Table VI. 


Taste VI.—EstTimaTep SPEED OF WRITING AT Qua.ity 10 or Pupits WHosE 
Scores ARE GIVEN IN TaBLE I UNDER “NORMAL” WRITING 


IE Xela hin c0id Oh wr cw Uae eb he sda ani ws pease ees wi 21.9 
Rae a Na ctrl ines Sain git aint de ahh ececeletrihte tales 30.1 
Die ssi scan si Kakshblinal Mek sha sedbadeaaeshe beri ews 19.9 
is ba cee w bec s o0e ce ePhe wns to bh CSC CHEE EM MASS 9 5406 24.0 
ice Sails sone h obb dake +k a wees phe BOY taken 26.7 
Deva deen PRi Vth aNRae ei ane andl vane see aesbce bites 32.8 


The advantage of expressing scores in terms of a single variable 
consists in the greater intelligibility. The disadvantages are two: 
More labor is required to compute the scores, since it is necessary 
first to compute the combined score, and in extreme cases the score is 
one which the pupils at the time may be unable actually to produce. 
For example, pupil X, above, who normally writes 34 letters a minute 
at quality 8, theoretically can write 9.6 letters at quality 12 and 78 
letters at quality 6. Actually, he may be unable to write any letters 
at quality 12, and possibly he cannot write 78 letters at any quality how- 
ever low.! When told to write as well as they can under test condi- 
tions, most children write about one quality unit better, and when told 
to write as fast as they can, about one unit lower than normal. Further 
extensions, theoretical in character, are not on these accounts invalid 
or futile and need not be misleading. They express the ways in which 
the pupil’s present ability might have been expressed with a different 
emphasis on speed and quality, and within certain limits, that will 
doubtless vary with individuals, they express the ways in which ability 
may now be actually expressed. Since they indicate the ways in 
which ability may be led, perhaps speedily, to express itself, tables like 
that for X become an excellent instrument for guidance. Would it 
be better for example, if X wrote 23 letters at quality 9, or 49 letters 
at quality 7, or in some intermediate combination than to continue at 
34 letters and quality 8? The table gives information that is much 
better than guessing when educational treatment for an individual or a 
class is under consideration. 

It will be understood of course that such a table as that for X does 
not express the result that will be obtained after further practice. It 


1 Neither one of these statements is certainly true. It is probable that by 
providing a model, X could equal quality 12 and by a little encouragement pro- 
duce 78 or more letters a minute. 
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tells only what the present ability is, how it might have been expressed, 
and several ways in which it may now be expressed. It may be used 
to assist in the guidance of future practice but it tells nothing concern- 
ing the amount of improvement that such practice may produce. 
The formulas, then, are merely devices by means of which quality 
and speed of writing may be combined into a valid single score, a 
statement of general ability, from which may also be derived a state- 
ment of this ability in terms of one while the other is held constant 
at any desired level. 

The actual time required for computing scores by Formula (1) 
is slight. Using printed tables from which cube roots and products 
may be read, the combined score is obtained in an instant. Formulas 
(2) or (3) mean an addition of a few seconds of time. Any of these may 
be obtained in less time than is required to add the several subtotals 
and compute an IQ from the Stanford-Binet or the National Intelli- 
gence Test. 

The formulas presented are applicable not only to the Thorndike 
Scale, but also to any other scale which yields equivalent scores. For 
tests to yield equivalent scores, several conditions must be fulfilled. 
First, the scale must present specimens of the same kind of quality as 
that found in the Thorndike. The Ayres Scale, for example, may not 
present measures of just the same combinations of qualities or merits, 
since the criteria used in selection were not the same. Studies by 
Kelley and Pintner,? however, indicate that the two scales represent 
at least similar types of writing quality. Second, for a scale to be 
equivalent to the Thorndike, the steps or units must be of the same 
magnitude and possess the same significance. This is the same as 
saying that they must be determined in an equivalent manner. Cour- 
tis? has recently suggested, following an ingenious analysis of the 
Cattell-Fullerton Theorem as applied to the construction of writing 
scales, that equal units yielded by this method are equal only in a 
certain sense. Units determined by other methods may not be equal 
in the same sense. Furthermore, the size of the units depends upon the 
ability of the individuals who serve as judges; the better the judges the 


1See also Kelley, T. L.: Comparable Measures. Journal Educational 
Psychology 1914. 
2 Pintner, Rudolph: Comparison of Ayres and Thorndike Writing Scales. 
Journal Educational Psychology, 1914. 
_ *Courtis, 8. A.: School and Society, July 14, 1923. 
4 Kelley, T. L.: Principles and Technique of Mental Measurement. Ameri- 
can Journal Psychology, July, 1923, p. 418ff. 
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smaller the differences between the steps, hence, to secure equivalent 
units, the groups of judges must be equivalent. Finally, even with 
scales equivalent as a whole, as the Thorndike and Ayres may be, the 
determination of the points or scores on one which correspond exactly 
to the points or scores on the other is frequently imperfectly done as 
attested by the fact that different methods yield different results.! 
It will be advisable, then, before applying the formulas with confidence 
to other scales to investigate its applicability by methods similar to, 
or better than, those used in the present study. 

Using the Thorndike or any other scale, the formulas will hold good 
only to the degree that merit or quality in the test-writing is accurately 
appraised. Courtis? and others have shown that individuals with wide 
school experience vary in ability to judge handwriting. Inaccuracies 
in judging the specimens will be somewhat magnified by the combined 
score, which will, however under such conditions, be a better estimate 
of general ability than quality alone since the inaccuracies in quality 
will, on the other hand, be tempered by the inclusion of rate. Rate, 
to. be used in the formula, must be measured in terms of number of 
letters per minute. 


THEORETICAL IMPLICATIONS 


In attempting empirically to establish the relation of speed and 
quality in the case of writing, many types of combinations were tried 
only to be discarded. Forexample, the general procedure represented by 


Combined score = Q* + S* 


(in which k and ¢ are constants) was variously tried. But no method 
of weighting either variable resulted in a measure of general ability 
which remained satisfactorily constant in different tests. That is, 
the method of adding extra credits for speed to the measure of quality, 
which is commonly used in many types of tests, proved unsuitable for 
writing. 

That the device of multiplying Q and S should satisfy the criterion 
of measuring general ability, would seem to be of considerable impor- 
tance, not only because of its practical value in the present instance, 
but because of its general implication. The successful formula is of 
the form 

General ability = Q‘S* 


1See T. L. Kelley’s data in “Statistied Method.” New York, Macmillan, 
1923, pp. 114-120. 
2 Op. cit. 





_, f= lUrhrlC eer ,.LhlUCUr LlCUOlUvrKlUCUrhlUCUC lO tC 





Pl Ss CY ~— YW \ew — a=_ 


rr ~ 





Relation of Quality and Speed 143 


It is suggested that in other functions, mental as well as motor, the 
proper score to express all aspects of ability is to be found by search 
for the proper weights of k and ¢. It is not suggested, of course, that 
the exact weighting here used, namely k = l andt = 4 will be appro- 
priate to drawing, typing, operating machines, sewing, or to reading, 
arithmetic, making business decisions, judging fruit, etc., but that 
some weights for the two variables which are combined by multiplica- 
tion, will constitute the solution. 

In the case of writing, the successful formula suggests, interest- 
ingly, the relative importance of quality and rate of performance in 
determining general ability. The formula shows that quality of 
performance is more important; speed, relatively is of slight signifi- 
cance in comparison with the weights which current methods of scoring 
usually yield. 

It has been observed in Table I that with the change of instructions 
speed fluctuates greatly and quality slightly although of course, they 
are always related. To increase one decrease the other. Rate how- 
ever, is more flexible; great changes in pace are possible. Thus as 
may be observed in Table I, some individuals may at one time write 
twice as fast as at another; indeed, the average rate for “quality” 
writing is 36.5 and for “speed’’ 58.9 letters per minute in Grade V. 
The fluctuations in quality are much smaller. The effect of the 
formula is to tremendously reduce the range (7.e., the weight) of speed. 
For example, for Grade V (Scarborough results) the average figures are: 


WRITING FOR WRITING WRITING FoR 





QUALITY NORMALLY Sprep 
Eb os ns diccanemeeh ences ced’ 10.375 9.41 8.75 
SE Gal ik. 6s J echucsMhssiwekevkes 45.7 64.7 73.3 
I 6 iisis 20:5 Sadys sniaeaeesesas 3.57 4.01 4.23 


From widely differing rate scores the formula transforms the results 
to small differences thus greatly reducing the weight of speed on the 
combined score. To use the fourth root would practically eliminate 
speed. In the case of writing, rate still has some weight. In other 
functions, its weight may be greater—the square root, the 2.1th or 
2.6th or 1.6th or some other root—but the suggestion here contained 
is that in most tests, the methods of scoring now commonly utilized, 
such as adding points for each of the number of items done under a 
time limit, or the addition of point credits for speed, have resulted in 
an over-emphasis of speed. The implication is that in mental and 
motor testing, the optimum measure of ability approximates more 
closely the measures of quality of performance and less closely the rate, 
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than the results commonly obtained. Not only do the facts in the 
case of writing indicate a relatively slight weight for speed in general, 
but they suggest the relative insignificance of very high speed, in 
particular. In other words, the addition of a given amount of speed— 
say 10 letters per second—is not the same everywhere but depends 
upon the amount already present to which it is added. As an incre- 
ment to a speed of 20 words per minute, it would be considerable, but 
added to 100 words per minute it would be very slight. This is appar- 
ent when a curve showing the relation between speed and the cube root 
of speed is shown (see Fig. 1). The facts here disclosed concerning the 
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Fic. 1.—A curve which shows how the “‘weights’’ of numbers are progressively 
decreased, relatively, when the cube root is substituted for the number itself. The 
difference between 10 and 20 is very great compared to the difference between 100 
and 110 which in turn is greater than that between 190 and 200. 


influence of increments of rate in determining total ability bear an 
interesting resemblance to the facts of perceptions summarized in 
Weber’s Law which is, in essence, a statement that the least perceptible 
difference between two stimuli—weights, tones, colors, odors, etc.-— 
is a constant function of the magnitude of the stimuli. 

In conclusion, it may be said that the problem of weighting rate 
and quality, where both are concerned in a combined score, cannot be 
avoided although we may, as frequently heretofore, ignore it. No 
matter how one may score, some weight for both quality and speed has 
been adopted. Where one is ignorant of the method of combining the 
two variables, the result is unlikely to be meritorious. It seems, there- 
fore, that one of the important tasks before students of mental 
measurement is the experimental determining of the relation of 
quality and rate of performance in functions for which tests have been, 
or may be, devised. 











RELIABILITY OF SCHOOL TESTS OF AUDITORY 
ACUITY 


HARVEY A. PETERSON 
Illinois State Normal University, Normal, Illinois 


AND 
JEROME G. KUDERNA 


The Lincoln School of Teachers College 


So far as the writers are aware there are no tests of the reliability of 
school tests of auditory acuity. Yet the need of such testsis very great. 
Statistics show that the percentage of children examined who are 
reported deaf varies from 1 to 50 per cent, due ‘“‘to lack of uniform 
standard as to what constitutes defective hearing, and to lack of 
uniformity in methods of testing.’ Within the work of a single school 
examiner, there are a number of factors which are claimed to make the 
results unreliable. Chief among these claims are the difficulty of 
maintaining a whisper of constant strength, even by experienced 
examiners, and the great variability in the amount of external distrac- 
tions. Where the stimuli are slight, as in the case of the whispered 
speech or the watch tick, nearly all distractions are apt to prove 
serious. 

In view of these facts the writers undertook to test the reliability 
of the whispered speech test and the watch tick test possible under 
ordinary school conditions. 

I. Description of the Tests and of the Conditions —"The plan was to 
have the same examiners give two whispered speech tests and 
two watch tests to the same group of persons, then to find the coeffi- 
cients of correlation between the first and second tests of the same kind 
and between the two different kinds of tests. 

For subjects 61 normal school students were used. All the tests 
were made in a room 32 feet by 75 feet. Its acoustic properties are 
excellent.? 

In the watch test 48 persons were tested by H. A. P. and 13 by 
J.G. K. Only one person was tested atatime. The subject kept his 


1 Terman, L. M.: The Hygiene of the School Child, p. 220. 

? Before beginning the tests the examiners underwent together a preliminary 
practice in testing a small number of persons to make their procedure uniform. 
Prior to this work H. A. P. had ten years experience in conducting auditory tests 
in the laboratory, and J. G. K. four years. 
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eyes closed while being tested. The maximum distance at which he 
could hear the watch tick when it was moved away from him and when 
it was moved toward him in half-yard units was found and averaged. 
These averages, obtained separately for the two ears, were his 
measures. When it was thought that the person was reaching his 
limit, the watch was now allowed to run and now stopped, and the 
person had to answer correctly twice out of three times whether he 
heard it or not. There were frequent short rests during each test to 
lessen fatigue. The average duration of a test of one person was from 
six to seven minutes. 

All of the speech tests were given by J.G. K. From 6 to 10 persons 
were tested at a time, seated in a curved row from east to west about 20 
feet away from the wall. “The examiner stood in the aisle opposite the 
middle of the row, and 45 feet from it. Three series of 10 numbers 
each were given each ear in each of the two trials. The measure of 
the hearing of each ear in a trial was the average of the three percent- 
ages of numbers correctly heard in the three series.? 

II. Central Tendencies and Measures of Variability—In Table I 
are given the results of the 61 persons who completed all the watch 
tests, and of the 47 persons who completed all the watch and speech 
tests. The figures in the watch test give the maximum distance in 
yards at which the watch could be heard, those in the speech test, the 
per cents of 30 numbers to each ear which could be heard at the range 
adopted by the examiner. The standard deviations are to be read the 
same way. There is a large improvement in the successive trials of 
the watch test, even between the two ears in the same test, while the 
improvement in the speech test is very slight. The coefficients of 
variability are also more consistent in the speech than in the watch 
test. | 


1 The average amount of time between the first and second watch test was 7.5 
days, AD 4 days; that between the first and second speech tests was 2.5 days, 
AD lday. The tests were given in the following order: 


(a) First trial of watch test, left ear 

(b) Same, right ear 

(c) Second trial of watch test, left ear 
(d) Same, right ear 

(e) First trial of speech test, left ear 

(f) Same, right ear 

(g) Second trial of speech test, left ear — 
(h) Same, right ear. 
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TaBLeE I.—MEDIANS AND VARIABILITIES 















































Watch Test 
Left ear Right ear 
First | Second | Aver-|| First | Second | Aver- 
trial trial age trial trial age 
nt oa euak eed 3.90 4.50 4.20)} 4.30 4.79 | 4.55 
Standard deviations....... 1.4 Ba leenees 1.8 1.5 
Coefficient of variability 0.36 GG hiedss. 0.42 0.31 
Speech Test 
ee ee 
NE gi ek | 80.6 | 81.9 (81.3 || 81.3 | 82.3 (81.8 
Standard deviations....... | 19.9 |18.9 |...... 16.5 | 15.6 
Coefficient of variability...| 0.25 ge Re 0.20 0.19 
} 




















III. The Distribution of the Scores—The frequency curves for 
the distribution of scores are omitted for lack of space. They are of 
fair normality but ‘rather peaked, showing that the scores are not 
widely dispersed. There is a slight skewing of the watch test curves 
toward the low end of the distribution due to large practice effect in a 
small number of cases. In the speech test there is a slight skewing 
toward the high end of the distribution due to the fact that a stimulus 
intensity was used that would make the median fall not lower than 75— 
80 per cent in order to avoid guessing. This increases slightly the 
number of high scores (90-95 per cent) hence the slight skewing to the 
right. Table I showed that the coefficients of variability in the watch 
tests are not only less consistent than in the speech tests, but that in 
every case they are larger. The latter indicates that the relative 
dispersion in the watch test is greater than in the speech test. 

IV. The Reliability of the Test Scores. (a) Difference between 
Reliability and Validity.—By validity is meant the extent to which 
the test measures that which it is claimed to measure, 7.e., auditory 
acuity for classroom stimuli, whereas by reliability is meant the extent 
to which the relative scores of the subjects are the same on every 


application of the test, z.e., how well the test measures that which it 
really does measure ' 





1 Kelley, T. L.: The Reliability of Test Scores. Journal “of Educational 
Research, 1921, Vol. III, p. 370. 
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(b) Reliability of the Speech Test and the Watch Test.—The relia- 
bility coefficients between two applications of each of the two tests of 
hearing for each ear, designated by 7:2 and r.c, for the watch test 
and the speech test, respectively, in Table II were calculated by the 
product-momentformula. In all cases the correlations were calculated 
between the two series of absolute scores.!_ The speech tests marked 
H. A. P. were carried out at a later time in connection with a study of 


the effect of practice, but under the same conditions as the speech 
tests by J. G. K.? 


TABLE II.—RELIABILITY COEFFICIENTS OF WATCH AND SPEECH TESTS 
Watch Test 











a 


| 4 
| Tie T1’2" V rit | 
ome | . { 


ES ES | 0.58+0.05 | 0.73 0.76 | f =50 
SSE) Ee Pre ee | 0.644+0.05 | 0.78 0.80 











Speech Test (J. G. K.) 











| 
ToC rere’ V ree 
ee ee ice 0.79+0.04| 0.88 | 0.89 | fs =61 
Re i AS ee 7 Gi 0.88 

















Speech Test (H. A. P.) 





NSS oe iS ha oe vad h Oae e one 0.76+0.04 | a 


| 
NN. 865.008 ee. 0.780.064 | 25. | 





1 Criticism may be raised against this procedure. However, the self-corre- 
lations between the absolute scores of a trial and the same scores expressed on a 
group percentage basis were LE 0.90 + 0.03 and RE 0.92 + 0.03. The 
standard deviations in the two sets of correlates differed by 2.1 per cent and 1.5 
per cent. Hence the results would be much the same whichever set of data is 
used. Moreover, the possibility of a spurious correlation which arises from the use 
of ratios for measurements is eliminated by correlating the absolute measures, as 
was done above. 

2 See p. 145. 
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The reliability coefficients representing the correlation between 
two applications of a test combined, as found by Brown’s formula,' 
are shown under 1,2 for the watch test and r.-c’ for the speech test 
in Table II. 

These are the usual reliability coefficients. But the extent to 
which the scores determined by a test would correlate with the true 
scores of the subjects is probably an even more significant index of 
reliability. These true scores may be defined as the average scores of 
individuals upon a very large number (an infinite number) of just 
such tests. The correlation between a single test and a true score in 
the trait measured by a single test is equal to 


mr 


JVn+n(n—1)ri2 Vii" 








This is, then, the maximum correlation which can ever be secured 
(except as a matter of chance) with the test under the conditions 
given. 

The values of +~/72 for the speech test (0.88 and 0.89) show that a 
speech test of the length used ought to be adequate to secure satis- 
factorily consistent measures of this type of auditory acuity, since a 
greater index of reliability than of this magnitude would necessitate 
a refinement of measurement inconsistent with the limitations of the 
experimental procedure. 

Another consideration that indicates the satisfactory reliability 
of the speech test is that not only are the reliability coefficients high 
but remarkably consistent. This is brought out in a more striking 
way by a discussion of the results of the speech test (H. A. P.) in 
Table II. For the purpose of determining the practice effect (to be 
discussed later), experimenter H. A. P. gave the speech test to a group 
of 51 subjects under conditions corresponding to those that prevailed 
in the speech test given by J. G. K. with the first group of 61 subjects. 
It will be noted that the four values of r, are found within a range of 
0.03 which represents a remarkable consistency in correlation coeffi- 
cients. Since the standard deviations of the two groups are the same 
within the limits of the probable error, this comparison is valid. 

The reliability of the coefficients as evidenced by their PE is 
satisfactory according to accepted standards. 


1 Brown, W.: Essentials of Mental Measurement. 
? Kelley, T. L.: Ibid., p. 372. 
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Another measure of reliability of a test, independent of the distri- 
bution of the group, is the standard error of estimate expressed in 
terms of the standard deviation. Its value for the speech test was 
0.44; for the watch test 0.64. This evidence points to the greater 
reliability of the speech test. 

(c) Reliability of the Watch Test.—The reliability of the watch test 
cannot be considered satisfactory, particularly when it is remembered 
that the ultimate aim is individual measurement. It seems to be the 
concensus of judgment among investigators of correlation data that 
a correlation of 0.70 (raw correlation) may be very significant when 
different traits are being correlated, but correlations of less than that 
value of the same trait on different days leads one to suspect the accu- 
racy of a single trial of the test (see Table II). 

Not only are the reliability coefficients of the watch test lower than 
those of the speech test but apparently they are less consistent. It 
might be of interest to determine to what extent the watch test would 
have to be lengthened to produce results as reliable as those of the 
speech test (0.78). Employing Brown’s formula, the results show that 
the watch test would have to be extended from two to two and one- 
half times its present length. 

V. Causes of Greater Reliability of the Speech Test. Practice Effect. 
The medians given in Table I show that there was a constant im- 
provement in the successive trials of the watch test. These medians 
were, in the order in which the trials were given: 1 LE, 3.9 yd.;1 RE, 
4.3 yd.; 2 LE, 4.5 yd.; 2 RE, 4.79 yd. The improvement is widely 
distributed among the individuals in the groups. The trials of the 
speech test show almost no improvement, the medians being, in the 
order in which the trials occurred: 80.6 per cent, 81.3 per cent, 81.9 
per cent, 82.3 per cent. 

The slight improvement in the speech test was thought perhaps to 
be due to the fact that it was given after the watch test. Practice 
with the watch test might have habituated the subjects to the condi- 
tions common to the two tests. Accordingly, two complete speech 
tests alone were given to a group of 51 students who had never been 
tested before. The conduct of the tests was the same as in the earlier 
series—except that H. A. P. did thé testing. Preliminary practice 
consisted in giving the usual two series of 10 numbers each to the left 
ear immediately before beginning the first test. The persons were 
tested in four groups. The results are given in Table III. 
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Tasuie III.—Practice Errect 1n Speecu TEsts 








Ist LE Ist RE 2nd LE 2nd RE 
Series............| 1 | 2 | 3 |Av.| 1 | 2] 3 |Av.] 1 | 2 | 3 |Av.| 1 | 2 | 3 |Av. 
Averages......... 82 |84 |83 |83 |80 |79 (80 |80 |84 |81 (87 |84 |82 |83 |79 |81 
Medians......... 85 88 |88 |84 185 |83 |85 |82 |87 |85 92 |88 |88 |89 85 |85 





















































The average scores of the second tests are only 1 per cent greater 
than those of the first tests (LE, 84 per cent and 83 per cent and RE, 
81 per cent and 80 per cent respectively). In LE, 24 persons im- 
proved an average of 10 per cent, 19 deteriorated an average of 8 per 
cent and 8 were stationary. In RE, 31 made an average improve- 
ment of 10 per cent; 18 deteriorated an average of 18 per cent; and 2 
were stationary. In the medians the improvement is only slightly 
greater. Hence we conclude that the small improvement in the speech 
tests of the original series (J. G. K.) was not due to the fact that they 
were given after the watch tests. 

Examination of the two practice series preceding the regular tests 
gave some ground for the belief that there was improvement due to 
practice in the speech test, but that it was short-lived and was about over 
by the end of 1 LE. However, these practice series were given partly 
for the benefit of the tester to allow him to standardize the whisper 
used, so it was impossible to tell whether the increases in the scores in 
the practice series were real improvements only, or due partly to the 
tester increasing the strength of the whisper, or to both. To decide 
the question, another group of 26 students was taken and divided into 
two groups, one of 12 and the other of 14. Each group was given 
(by H. A. P.) five series of 10 numbers each to the left ear without 
any preliminary practice whatever for the subjects. The second group 
was tested immediately after the first, as it was thought that if there 
was any tendency for the tester unconsciously to increase the strength 
of the stimulus, it would be over by the end of the work with the first 
group, and if the improvement in the scores occurred with both groups 
to about the same extent, it would be due to practice on the part of the 
subjects, and not to a change in the stimulus. The results are given 
in Table IV. It is clear that there is improvement in the speech test and 
that it is due to practice, but that it is short lived, and runs its course during 
the five series. These five series correspond to the two practice series 
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and the first regular test to the left ear in the earlier tests conducted 
by J. G. K. Both groups start at about 74 per cent, reach their 
maximum in the next three series, and make no further advance in the 
fifth series. Proof has already been given in Table I that there is 
very little improvement after the completion of 1LE. 


TaBLeE IV.—SHOWING THE EXISTENCE AND BrieF DURATION oF PRACTICE EFFEcT 
IN THE SPEECH TEST 











Left Ear 
Group 1 Group 2 
rem | 
Series.......... 1} 2! 3| 4! 5 |Series....... 1] 2! 3] 4| 5 
Averages...... | 74 | 78 | 79 | 95 | 87 | Average.....| 73 84 | 85 | 83 














But why does the watch test show so much improvement due to 
practice? At first it is difficult to distinguish the watch tick, beyond 
the first yard or two, from other faint sounds—the movements of air 
through the hot air register, the creaking of the windows, the whirr of 
distant lawn mowers or the faint chirping of birds. The final identi- 
fication of the tick made possible a large improvement. In the speech 
test this difficulty was not nearly so acute. The numbers were easily 
recognized as numbers if they could be heard at all. They were the 
only animate sound. Again the watch test is a varied range test. It 
encourages indefinite improvement by lengthening the distance; while 
in the speech test, with the median person hearing 80 per cent of the 
numbers spoken, this was not true, 95 per cent being approximately 
the best record obtainable. 

Now, analysis of the individual gains and losses in the watch tests 
shows that individuals differ widely in their ability to profit by practice. 
The complete distributions of improvement, positive and negative, 
approximate roughly the normal curve. 


TABLE V.—DISTRIBUTIONS OF IMPROVEMENT OF SECOND WaTCH TEST OVER FIRST 


Pla hdinnS wh pie dens te eapens -3 -2 -1 12 3465 67 8 
EE et ao arene ee 1 1 9 2 1433321021 
De nctieh ceca eceghes.cucans 2 2 2 9 13 18 8 5 2 


The effect of all these facts is inevitably to lower the reliability 
coefficients in the watch test. For if some individuals gain greatly 
and others little and the gain is not symmetrically distributed over 
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the range, there will be much shifting in the relative positions of the 
individuals from first to second trials, and this necessarily lowers the 
correlation between the two trials. On the other hand the speech 
test, because there is only slight improvement in it, between two suc- 
cessive trials, shows a higher correlation. 

VI. Correlation between Watch and Speech Tests——In order to 
study the correspondence between the watch and speech tests of hear- 
ing, the coefficients of correlation (product-moment) were calculated 
between the first trial of the speech test and the first trial of the watch 
test, between the second trial of the speech test and the second trial 
of the watch test, and between the averages of the two trials for 
each test. These results are recorded in Table VI, under rj2.! 


TaBLE VI.—CoRRELATION BETWEEN WATCH AND SpEEcH Tests or HEARING 


Lert Ear f = 47 Tie Rieut Ear f = 47 Tie 
First trials..........::... 0.5340.07 First trials............... 0.53+0.07 
Second trials............. 0.50+0.07 Second trials............. 0.51+0.07 
Average of first with aver- Average of first with aver- 
age of second.......... 0.54+0.07 ageofsecond........... 0.55+0.07 


The correlation of 0.50 to 0.55 is considerable, but not as high as 
might be expected. The reliability of the coefficients is established by 
the fact that they are about seven to eight times as large as their 
PE. Moreover, the remarkable consistency between the six values 
of r adds to the thorough reliability of the coefficients. 

VII. Causes of Imperfect Reliability of Both Tests—We here under- 
take to enumerate the chief causes of the imperfect reliability of both 
tests. Foremost among the interfering factors is that of practice, 
fully discussed above. 

In spite of the care exercised in eliminating the data of subjects 
suffering with colds, their bodily tone necessarily varied from day to 
day; thus the longer the time interval between tests, the lower the 
correlation. In this respect, the watch test suffered more than the 
speech test (average interval between the two trials of watch test 
7.5 days; speech test 2.5 days). 

Extraneous noises (see p. 152) although controlled insofar as 
possible—with the greatest of care—either by elimination or dis- 
continuance of the testing, nevertheless proved another source of 


1 Application of Blakeman’s practical criterion for linearity shows that the 
lines of the means in the two tests are true linear regressions, and hence that it is 
permissible to use the product moment formula in calculating the coefficients of 
correlation (Blakeman, J.: Biometrika, Vol. IV, p. 349). 
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attenuation. It was difficult to estimate this interference for inquiry 
of the subjects introduced the dangerous possibility of suggestion. 

Because of the complexity of the speech stimulus, its variability in 
intensity, pitch, and timbre, and the necessity of its interpretation as 
particular numbers by the subject, it is possible for subjects in the 
lower range of acuity to get some score by the watch test when a score 
of zero is recorded for the speech test. In the first place this would 
tend to lower the correlation between the two tests; second, the range 
of possible displacement of an individual score from one trial to the 
next could be greater in the watch test than in the speech test and 
hence lower the reliability coefficient of the former more than the 
latter. 

Investigations have shown that within closed rooms the intensity 
of the sound does not vary inversely as the square of the distance in a 
uniform manner. This would produce unequal practice effects in 
different parts of the room and hence lower the correlation between 
the two trials of the watch test.! 

Among the subjective factors, the whole complex of zeal involving 
incentive, attention, and fatigue presents a difficult problem. The 
inability of the unpracticed subject to pay strict attention, resulting 
in fluctuations, the changes in the vividness or clearness of auditory 
sensations due to these changes in attention, the difficulty of locating 
the threshold of sensation or the point at which the stimulus becomes 
strong enough to arouse sensations are all factors that probably affect 
the results of the watch test to a greater extent than the speech test. 

In view of these and other possible disturbing factors it appears 
that the coefficients fall considerably below the true values, especially 
in the case of the watch test. 

VIII. Validity of the Watch Test—It may be well to examine the 
validity of the watch test. Kelley states: “If a measure correlates 
very highly with known measures of capacity, it must of necessity 
have a fair degree of reliability, but as the converse is not true—that 
if a test has high reliability, it will correlate well with a valid criterion 
—correlation with a good criterion should be used as a measure of 
validity and not of reliability.”? The difficulty that arises in this 
investigation is that a valid criterion of auditory acuity has not yet 
boen developed. However, in view of the fact that whispered numbers 


1 Andrews, B. R.: American Journal of Psychology, 1904, 15, 14-56, and 1905, 
Vol. XVI, pp. 302-26. 
2 See Kelley, ibid.. p. 371. 
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resemble the speech of the schoolroom, to which may be added the 
high reliability of the speech test and the consistency of the reliability 
coefficients, it seems reasonable to accept, tentatively at least, the 
speech test as a criterion. Since this criterion is not one of perfect 
reliability, in order to be able better to understand the similarity 
between the functions measured in the watch test and those in 
the criterion, it is necessary to know not only the correlation with 


the criterion, but also the reliability of the test and the reliability of the» 


criterion. In Table II, r.c is the reliability coefficient of the assumed 
criterion (the speech test); then ~/r,c is the maximum possible corre- 
lation with it (except as a matter of chance). The correlation between 
the watch test and the criterion isr,,.. In such a case as this where the 
criterion is not of perfect reliability “the maximum value which r;, 
can have (this maximum is reached when no elements except chance 
other than those involved in the criterion are involved in the test score) 
is Wris X Vre.” When this is true “the best results possible of 
anticipation are obtained; that is, all the factors of which the test 
score is a measure are either (1) factors likewise measured by the 
criterion or (2) chance factors.”” Turning to the data, it is seen that 
for the left ear the r;. of the average of the tests is 0.54 while ~/r32. X 
~/r.cis 0.68. For the right ear, 71, = 0.55 and Wry. X ~/r.c = 0.70. 
The fact that r,. is considerably less than ~/r12. X ~/r.c indicates that 
some of the functions of auditory acuity measured by the watch test 
are not identical with functions involved in the speech test. Here, 
then, we have statistical evidence that the two tests do not measure 
in their entirety the same phases of auditory acuity. 

IX. Quartile Retention—lInasmuch as the ultimate aim of the 
tests is the location of subjects defective in hearing, the distribu- 
tions of the two trials of the watch and speech tests for each ear were 
arranged in quartile tables. 


TaBLe VII.—QuvuarTILE RETENTION 








PER Per 
Cent CENT 
Watch Test, LE Speech Test, LE 
Total quartile retention........ 36 Total quartile retention........ . 40 
Retention in lowest quartile.... 40 Retention in the lowest quartile. 60 
Watch test, RE Speech Test, RE 
Total quartile retention........ 39 Total quartile retention........ ; 2 
Retention in lowest quartile.... 56 Retention in the lowest quartile. 59 


The above table indicates that about four subjects in ten are 
likely to be placed in their proper quartile as the result of a single 
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testing. However, in the lowest quartile—the significant quartile— 
the best showing (about 6 in 10) is made by the speech test. More- 
over, in the watch test, the retention in the lowest quartile is greater 
for the right ear than the corresponding retention for the left ear, 
largely because of the diminishing practice effect. 

In conclusion, in spite of the satisfactory reliability of the speech 
test, these facts make it questionable whether a single trial of either test, 
‘especially the watch test is sufficient in locating specifically the individuals 
in the quartile where the seriously deaf are to be found. The correlation 
of 0.88 is sufficiently low to make it possible for an individual to shift 
from the upper end of the fourth quartile to the lower end of the third, 
- or vice versa; whereas such a displacement may be actually small yet it 
removes an individual from the seriously deaf quartile to one of the 
others. Hence, undoubtedly, those of the lowest quartile should be 
retested before professional examination is recommended. 

Conclusions.—(1) Perhaps the outstanding result of this investi- 
gation is the established greater reliability of the speech test as com- 
pared with the watch test, provided the former is given by a trained 
examiner. This is significant inasmuch as the fundamental criticism 
of the speech test has been directed against the variability of the 
stimulus. : 

2. This smaller reliability of the watch test has been found to be 
due mainly to the large practice effect in it. 

3. In general, a single trial by the speech test of the lengths used 
(60 numbers), in view of its reliability of 0.88, is adequate for the 
preliminary location of individuals with subnormal hearing. How- 
ever, before recommending professional examination, the lowest 
quartile should be retested. 

4. A single testing by the watch test is not reliable. In order 
to secure as satisfactory a result as in the speech test, from two to three 
trials would be necessary, involving an expenditure of about six times 
as much time as used in the speech test. 

5. The two tests of hearing do not test identical complexes of 
auditory acuity as evidenced by the criterion formula. 

6. Repeated testings fail to reveal wand difference in the auditory 
acuity of the two ears. 

7. There is a moderate correlation between the watch and speech 
tests, 0.50 to 0.55 + 0.07. 
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AN EMPIRICAL STUDY OF THE VARIOUS 
METHODS OF COMBINING INCOMPLETE 
ORDER OF MERIT RATINGS 


HENRY E. GARRETT 


Columbia University 


In any study in which individuals or things are to be ranked in order 
of merit for a given trait or attribute possessed in varying degree, the 
question often arises of what to do with those lists which do not include 
all the persons or things to be rated. Suppose, for example, that there 
are 15 individuals to be ranked in order of merit for a given capacity, e.g., 
salesmanship, intelligence, or honesty. There are, let us say, 10 
judges. One judge may know the 15 well enough to rank them all; 
another judge may feel qualified to rank only 5 or 6; another 9 or 10, 
etc. It at once becomes important to know whether the information 
derived from these incomplete rankings can be utilized to give a 
complete and fair order of merit for the entire group. 

The first and the simplest procedure which suggests itself is to aver- 
age the rankings just as is done with complete lists; that is, to add the 
ranks assigned to any individual or thing and divide the sum by the 
number of judges, using only those judges in whose lists the particular 
individual isfound. The final order of merit can then be made up on 
the basis of these averages. This method takes no account of the fact 
that a rank of 3 in a list of 15, means a quite different thing from a rank 
of 3 in a list of 3. That is, it gives no more credit to the person who 
ranks first in a list of 25, than to the one who ranks first in a list of 5; 
both are merely rated as of the first rank. 

Two methods have been devised to meet this difficulty. Thorn- 
dike! suggests that the first step in combining incomplete lists is ‘‘to 
get a rough approximation to the true order by inspection, or by com- 
puting the median position for each person, or otherwise.’”’ The next 
step is to compare each person rated with his neighbor in terms of the 
percentage of judges (using only those who rated both) who rated the 
first as lower than the second. This percentage can then be converted 
into Median Deviation (PE) units, on the assumption that the varia- 
bility of the opinions of the judges on any individual is approximately 


1The Technique of Combining Incomplete Judgments of the Relative Posi- 
tions of N Facts Made by N Judges. Journal Philosophy, Psychology and Scien- 
tific Method, 1916, Vol. XIII, 197-204. 
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that of the normal curve. Thorndike gives a table in which any divi- 
sion of opinion among 2 to 15 judges may be expressed in terms of the 
Median Deviation. As a final step, each person is compared with 2 
or 3 next neighbors, the differences averaged, and the final order of 
merit drawn up. Any necessary changes in position resulting from 
this last step can then be made. Thorndike illustrates his method by 
combining the judgments of general intelligence made on 34 freshmen 
by 18 individuals who knew from 14 to 29 of the 34, and who rated 
only those whom they knew. 

M. J. Ream! who reports the above method in practice extremely 
“laborious” and often “‘productive of contradictory results,’”’ has sug- 
gested a simpler method which is really a modification of the Thorndike 
method of comparison. The method proposed by Ream makes it 
possible to secure a final order by assuming a normal distribution in 
each partial list of those rated. The individuals ranked by any one 
judge are given rank-position. values in SD terms according to the 
number of persons in the list. Thus the person ranked first in a list of 
15 would be assigned an SD rank of 1.92, the average distance from 
the average in SD units of the upper 634 per cent of the group. The 
person ranked fifth in the same group would be given an SD rank of 
.53, the average distance from the average of the fifth 624 per cent 
of the distribution, counting off the 27 per cent already used up from 
the upper end of the curve. In a list of 5, the individual ranked first 
would be given an SD rank of 1.40, the average distance from the 
average of the upper 20 per cent of the distribution; the next person 
would receive an SD rank of .53, the average value of the next 20 
per cent from the top, and so on for the other three. As all SD values 
are taken from the average, those individuals below the average receive 
negative SD values. The SD values corresponding to any given per- 
centage of the cases may be read directly from Table XXII of Thorn- 
dike’s ‘‘Mental and Social Measurements.” 

In criticism of the SD method it may be said that the assumption 
of a normal distribution when 5 or 6 persons or things are rated is 
justifiable only in the absence of a more valid assumption. The work 
involved in combining partial judgment lists by this method, while 
very much less than that required by the comparison method, is con- 
siderably greater than the work of merely averaging the ranks in the 
incomplete lists for the final positions. 


1 A Statistical Method for Incomplete Order of Merit Ratings. Journal Applied 
Psychology, 1921, Vol. V, 261-266. 











Methods of Combining Merit Ratings 159 


In addition to the three methods outlined above, a fourth, the 
Percentile Method, may be used. In this method each person or thing 
rated is given a percentile rank according to the length of the list and 
his position in it. A person ranked third in a list of 5 would, there- 
fore, receive a percentile rank of 50; the person ranked first would 
receive a rank of 10 (the midpoint value of the first 20 units); while the 
person ranked last—fifth—would receive a percentile rank of 90 (the 
midpoint of the last 20 units—80 to 100). After converting each 
list into percentiles, the percentile values given each of the individuals 
rated are averaged, and the final orderof merit compiled. This method 
is relatively simple and makes no assumptions of normal distributions. 
At the same time it “‘weights’”’ the ranks according to position and 
length of list fully as effectively as the SD method. 

The difficulty (which is perfectly obvious) in testing the reliability 
of any one or all of these methods is the absence of a criterion in the 
form of a complete order from complete lists (or otherwise determined) 
with which the final order of merit as derived from the incomplete lists 
may be compared. In the absence of some such criterion it is impos- 
sible to say which of the methods is the most accurate. 

In the three experiments to be described in this paper, the writer 
has avoided this difficulty of a criterion through the simple arrange- 
ment of having each judge rank all of the persons or things to be rated 
first, and of later making up partial lists of any desired length. The 
general purpose of the experiments was (1) using all four methods 
outlined above, to find how well the final order of merit derived from 
incomplete lists tallies with the order obtained from the same lists 
when each judge ranks each person or thing—or when the “true’”’ 
order is otherwise known; and (2) to compare the accuracy of the 
methods of combining incomplete judgment lists. 

In the first experiment, 15 samples of handwriting were used as 
material. These specimens had all been carefully graded by the 
Thorndike Handwriting Scale and were supplied, together with their 
scale values, by Prof. Thorndike. Asfarasthiscan be determined by the 
scale, their true order of merit for beauty and legibility, therefore, was 
known. To each of 13 graduate students who acted as judges, a 
different number of samples selected at random from the list of 15 was 
given for arrangement in an order of merit series. Thus the first 
judge ranked 14 samples, the second 13, and so on to the last judge 
who ranked only 2. The criteria were beauty and legibility. As 
soon as the samples in the partial lists had been ranked, each judge was 
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given the entire 15 samples, carefully shuffled, and asked to rank the 
whole set. In this way two lists were obtained from each judge—one 
complete order of merit of the 15 samples, and one partial order of 
from 2 to 14 samples. 

The 13 complete orders of merit were combined to give a final order 
by averaging the positions assigned to each sample and arranging these 
averages in order of size. Thus two orders, the average complete, and 
the scale order were available as standards for comparison with the 
final orders as obtained by the four methods of combining partial lists. 

Table I gives for the 15 samples the orders of merit obtained from 
the scale values, from averaging the complete lists, and from combining 
the incomplete lists by each of the four methods under trial. (For 
brevity of description, these four methods will be known hereafter 
as the Averaging Method, the SD Method, the Comparison Method, 
and the Percentile Method.) Each of the final orders of merit asfound 
from the incomplete lists is correlated with the scale order by means of 
the Squared-difference Formula; and the average deviation of the 
position of any sample as found from the incomplete lists, from the 
corresponding position in the scale order has been calculated. In 
view of the very high correlation in every case there is little to choose 
as among the four methods; though the SD Method is slightly less 
accurate than the other three. The important thing is that the 
relatively simple methods give as good results as the more elaborate, 
and with far less labor of computation. It is also worth noting that 
the order obtained from averaging the complete lists is no more accu- 
rate, as judged by the scale order, than the orders got from the incom- 
plete lists. 

As a further test of the four methods, a final order for the 15 samples 
was worked out in which various combinations of partial lists were 
selected. First, those lists containing 2, 3, 4, 5, and 6 samples were 
combined to give a final order of merit for all 15 samples; then the lists 
from 2 to 7 inclusive, 2 to 8, 2 to 9, 2 to 10, 2 to 11, 2 to 12, 2 to 13, all 
inclusive, were combined by each of the four methods. As no rank 
was given specimen No. 5 in any of the lists before 11, only 14 
samples could be given a final rank in combinations of lists which did 
not contain judgments on at least 11 specimens. However, as the 
rank of No. 5 in the scale order was 14.5 (tied with No. 2) it seemed 
fair to rank No. 2 as 14 in the scale order and drop out No. 5 until a 
rank for it could be obtained. This left the scale order intact from 1 
to 14, for a comparison with the final orders got from combining the 
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TaBLE I.—SHOWING THE ORDERS OF MERIT OBTAINED FROM THE THORNDIKE 
ScaLE Vatvues For 15 Samptes OF HANDWRITING; ALSO THE ORDERS OF 
MERIT FOR THE SAME MATERIAL OBTAINED FROM AVERAGING THE 
CoMPLETE JUDGMENT LiIsTs, AND FROM COMBINING THE INCOMPLETE 


RatTincs BY Four MEeETHops. 


THe CORRELATION OF Eacu 


ORDER WITH THE ScALE OrpeER Is Given. ALSO THE 
CORRELATION OF EacH ORDER WITH THE AVERAGE 
CoMPLETE ORDER 






































A 
Seale Seale Average Average Com- SD Percen- 
Sample caren Fry complete | incomplete | parison O. of M tile 
‘ "| O.of M. | O. of M. |O.of M.| ~° *|O. of M. 

1 11.5 9 10 11 10 11 11 
2 8.8 14.5 12 13 13.5 13 12 
3 13.5 5 6 6 7 6 6 
4 9.1 12 13 12 12 14 14 
5 8.8 14.5 15 15 15 15 15 
6 12.5 7 7 8 6 8 7 
7 14.9 1 3 2 2 3 3 
8 12.9 6 5 5 5 4 5 
9 14.2 3 2 3 3 2 2 
10 14.5 2 1 1 1 1 1 
11 10.9 10 8 9 9 9 9 
12 11.8 8 9 7 8 7 8 
13 10.5 11 11 10 11 10 10 
14 9.0 13 14 14 13.5 12 13 
15 13.7 4 4 4 4 5 4 

r (Scale order and Average Incomplete order)................ .97 

Pe ccs asec ceccasbcscuedactussed .95 

r (Scale order and Comparison order)....................... .98 

r (Scale order and Percentile order)......................4.. .96 

r (Average Complete order and Average incomplete).......... .98 

r (Average Complete order and SD order)................... .97 

r (Average Complete order and Comparison) ................ .98 

r (Average Complete order and Percentile) .................. .99 

r (Average Complete order and Scale order) ................. . 96 


B 


The AD of any position in the various orders of merit from the Scale order. 


AD of any position in the Average Incomplete from the Scale order 
AD of any position in the SD order from the Scale order 
AD of any position in the Comparison order from the Scale order 
AD of any position in the Percentile order from the Scale order 
AD of any position in the Average Complete from the Scale order 
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TasB_eE II.—TueEe ComMpPpaARISON OF THE Four METHODS OF COMBINING INCOMPLETE 
Lists, WHEN AN INCREASING NUMBER OF PARTIAL Lists ARE 
CoMBINED. THE 7’s ARE ALL WITH THE SCALE ORDERS 





Sums from the beginning 


| 


(2-6) | (2-7) | (2-8) | (2-9) | (2-10) | (2-11) | (2-12) | (2-13) 


Method 














Average Incom- 


| See a 77 .78 .79 81 .93 .95 .94 
AEE ee .70 72 .68 75 81 .89 .92 .97 
Comparison r....| ! : : .78 .80 .88 .92 .98 
Percentile r...... .65 .70 .69 . 76 .82 .82 .92 .92 
































Explanation: (2-6), (2-7) etc., means that the incomplete lists containing 
2, 3, 4, 5, 6, and 2, 3, 4, 5, 6, 7 items, respectively, were combined to give the 
order which is compared with the scale order. 


1The data here were too scanty for making a fair test of the Comparison 
Method. 


incomplete lists. The correlations of the final orders from the four 
methods with the scale order are given in Table II. In the orders, 2 
to 6, 2 to 7, 2 to 8, the data were too scanty for making a fair test of the 
Comparison Method, and consequently the’r is here left out. It is 
apparent from the figures given that no one method is markedly 
superior; the simple methods continue to hold their own with the 
Comparison and SD Methods, and in several instances the correlation 
for the former is higher than for the latter. 

In order to check the results obtained in the above experiment, a 
second experiment was performed in which the same procedure was 
employed. Twelve samples of printing were used as material. The 
printing was done in radically different type, and consisted of two lines 
from the Declaration of Independence. With legibility and beauty 
as criteria, each of 10 judges was asked to rank, first, all 12 specimens 
in order of merit; and, when this was completed, certain specimens 
chosen at random from the entire list and numbering from 2 to 11. 
In this manner 10 complete lists and 10 incomplete lists were secured. 
Three additional persons were asked to rank the 12 samples, so that 
the final average order from the complete lists is based on the rankings 
of 13 judges. Table III gives the results of the correlation of this 
final complete order with the orders obtained from the four methods 
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of combining incomplete lists. The Average Method is slightly 
more accurate than the other three methods, which are all equally 
accurate. All of the correlations are high, but not remarkably so for 
an experiment under the conditions described. In its AD of any 
position from the corresponding scale order, the Average Method is 
again more accurate than the other methods. 


Taste III].—SuHowina THE Finat ORDER oF MERIT OBTAINED BY AVERAGING 
THE CoMPLETE ORDERS OF 13 JuDGES ON 12 SAMPLES OF PRINTING DoNnE 
IN DIFFERENT Type. OrpeRS OF MERIT FOR THE SAME MATERIAL 
OBTAINED BY COMBINING INCOMPLETE JUDGMENT Lists ARE ALSO 
Given. ALL CoRRELATIONS ARE BETWEEN THE ORDERS BY 
THE Four METHODS AND THE ORDER FROM CoMPLETE LisTs 


























A 
Average Average . : 
Sample . Comparison | Percentile SD 
N "| ae | Od. | Od. | Ot 
” O.of M. | O.of M. sinc atid sane ss 
1 3 3 3 5 5 
2 2 2 1 1 1 
3 12 12 12 12 12 
4 1 1 2 2 2 
5 10 7 8 8 8 
6 4 4 4 3 3 
7 6 8 5 7 7 
8 8.5 9 10 10 11 
9 8.5 10 11 11 10 
10 7 6 7 4 4 
11 5 5 6 6 6 
12 11 11 9 9 9 
r (Average Complete order and Average Incomplete order).... .94 
r (Average Complete order and Comparison order)........... .92 
r (Average Complete order and SD order)................... .88 
r (Average Complete order and Percentile order)............. .88 
B 
AD of any position in the various orders of merit from the Average Complete 
order 
AD of any position in the Average Incomplete order from the Average 
ED. «2000000 6 tbhoGU 400 068660 CUED EE RD 6 0.0.0.0:00 060,06 009066000010 . 67 
AD of any position in the Comparison order from the Average Complete... 1.00 
AD of any position in the SD order from the Average Complete........... 1.50 
_AD of any position in the Percentile order from the Average Complete. .... 1.50 
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TaBLE IV.—ComMPARISON OF THE Four METHODS OF COMBINING INCOMPLETE 
Lists WHEN AN INCREASING NUMBER OF PARTIAL JUDGMENT Lists ARE 
CoMBINED. THE 7’s ARE WITH THE AVERAGE COMPLETE ORDER 


| 





Sums from the beginning 








Method 
(2-6) | (2-7) | (2-8) | (2-9) | (2-10) | (2-11) 
Average Incomplete r............ .20 .60 .77 . 84 91 .94 
SA CR Te See errey .32 .64 . 64 .85 :89 .88 
CS rer aa 1 i . 80 .93 91 
ESS are we / ae .52 . 62 81 .92 .88 























Explanation: (2-6), (2-7) ete., means that the incomplete lists containing 
2, 3, 4, 5, 6 and 2, 3, 4, 5, 6, 7 items, respectively, were combined to give the order 
which is compared with the Average Complete order. 


1 The data here were too scanty to make a fair test of the Comparison Method. 


In this experiment, as in the first, the incomplete lists were next 
combined in order, first the lists containing 2 to 6 items, inclusive, 
and then those lists containing 2 to 7,2 to 8, 2to9,2to10. All four of 
the methods of combining incomplete lists were used, and the final 
order for the 12 samples as found by each method, was compared with 
the order found by averaging the complete lists. It was impossible 
to use the Comparison Method on the data from lists 2 to 6, 2 to 7, 
and 2 to 8, and even with lists 2 to 9, and 2 to 10, the final order was 
often highly uncertain. This was due to the fewness of judgments, 
and to the contradictions in position (also mentioned as a difficulty 
by Ream) which arose when a neighbor to neighbor comparison was 
attempted. Table IV gives the results. It is clear that except in the 
shorter list combinations, where the order as given by the incomplete 
methods is far afield from the final order as found from the complete 
lists, the four methods give about the same degree of accuracy. It is 
evident, also, that the final orders of merit which result from combining 
short lists and scrappy data, such as that given in lists 2 to 6, 2 to 7, 
and 2 to 8, are very unreliable when taken as approximations to the 
true order (taking as the ‘‘true’’ order that order which is obtained 
from a reasonably large number of complete lists). 

The two experiments, which have been described, make use of 
material on which there would certainly be fairly close agreement 
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among persons of approximately the same age and training. In 
general there is pretty close agreement as to what constitutes good 
handwriting, and “‘good,”’’ e.g., legible, print. In ranking individuals 
for character traits (such as honesty, tact, or persistence) or for general 
intelligence, there would most certainly be less agreement due to 
varying degrees of acquaintance, different standards for intelligence 
or honesty, prejudice, and other subjective (and hence variable) 
factors. Even the final order of merit from complete lists would very 
probably fall short of the hypothetical “true’’ order; so that the cri- 
terion may actually be less accurate than the final order got from com- 
bining partial judgment lists. It is, however, the only criterion avail- 
able in terms of which our four methods can be evaluated and, other 
things being equal, should be nearer the “true’’ order than any of the 
final orders derived from incomplete lists. 

In the third experiment, the four methods are compared for judg- 
ments made on a distinctly debatable possession, namely, that of 
eminence. To each of 16 graduate students was submitted a list 
containing the names of forty psychologists. The instructions were as 
follows: ‘‘Rank the following in order of merit, putting the one whom 
you think the most eminent No. 1, the next No. 2, and so on through 
the list.”” In this way 16 complete orders were obtained, each con- 
taining 40 names in order of merit. Each judge was then told to 
eliminate a certain number of names (those psychologists with whom he 
was least well acquainted) and to arrange the remaining names in 
order of merit. One judge was told to eliminate 5 names, another 6, 
and so on to 20. While the rater decided what names to eliminate, 
the number of names was entirely arbitrary, and was assigned by the 
experimenter. 

From the 16 complete lists a final order was made up. Asa matter 
of interest in addition to the usual method of averaging the ranks 
assigned by the individual raters for the final order, the SD method— 
used heretofore only with incomplete data was employed. In time 
required the SD method took nearly four times as long as the Average, 
not counting the time spent in making up the tables for determining 
the SD values. The 16 partial lists were then combined by the four 
methods, and the resulting orders correlated with the order of merit 
from the complete lists. The results given in Table V are completely 
in line with those of the other two experiments. The orders obtained 
by the Average Method and the Percentile Method are fully as accu- 
rate (as judged by the criterion) as those got by the more refined 
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Taste V.—MatTeriaL: A List ConraIniInc THE Names or 40 PsycHOLo- 
Gists TO Be RANKED IN ORDER OF MERIT FOR EMINENCE. THE TABLE 
Gives THE ORDER OF MERIT OBTAINED FROM AVERAGING THE Com- 


PLETE JUDGMENT Lists oF 16 JUDGES; 


ALSO THE ORDERS 


OBTAINED FROM VaRIOUS NuMBERS OF INCOMPLETE LISTS BY 


Four MEeEtTHOops. 


Att CorRRELATIONS ARE WITH THE 


AVERAGE CoMPLETE OrpDER. THE SD Metuop Was 
Autso Usep ON THE CompPLETE Lists (SEE 
Last CoLumn) 








A 
Average Average . . 8D 
Individual | complete | incomplete | ., 5), ae remy (complete) 

No. O. of M. O. of M. —— shoe as O. of M. 
1 7 s 8 8 9 8 
2 5 5 6 4 7 5 
3 27.5 24 22 24 21 28 
4 2 2 2 2 2 2 
5 6 7 7 9 3 7 
6 9 11 11 11 13 9 
7 3 3 3 3 4 3 
8 17 17.5 23 23 19 17 
9 4 4 4 6 5 4 
10 21 19 18.5 20 22 20 
11 13 27 21 22 23 14 
12 19 17.5 15 14 20 19 
13 33 35 34 30 32 32 
14 31 26 20 21 12 33 
15 1 1 1 1 1 1 
16 23 21 25 26 27 24 
17 39 39 40 40 38 40 
18 11 9.5 10 10 11 10 
19 22 20 24 25 17 21 
20 10 9.5 9 7 8 12 
21 8 6 5 5 6 6 
22 24 28 29 29 26 26 
23 12 12 12 12 10 11 
24 25.5 15 17 13 18 25 
25 29 23 18.5 18 30 30 
26 25.5 16 16 17 15 23 
27 18 22 27 19 24 22 
28 30 32 33 33 28 29 
29 16 13 13 15 14 16 
30 40 38 38 39 40 39 
31 34 36 35 35 35 34 
32 37 33 37 37 36 37 
33 15 25 26 27 25 15 
34 38 30.5 39 38 37 38 
35 32 34 31 34 33 31 
36 14 14 14 16 16 13 
37 20 29 30 32 29 18 
38 36 40 28 31 39 36 
39 35 37 36 36 34 35 
40 27.5 30.5 32 28 31 27 
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methods. The writer, using a Monroe Calculating Machine, made 
up the final order from the incomplete lists by the Average Method in 
30 minutes; the Percentile Method took a little more than twice as 
long; the SD Method, with the tables of SD values made, required 
nearly three hours; and the Comparison Method took the whole of 
an afternoon. 

Judged from the standpoint of both accuracy and time required, 
it has been shown in every case that the simple methods are better than 
those which make use of a more elaborate technique. This result 
was rather surprising, in view of the fact that the Average Method, as 
was said before, takes no account of the different meaning which the 
same rating may have in series of different length. The reason why 
the Average Method is as accurate as the SD or the Comparison must 
be due to the fact that the final order depends on the averages of all 
the ratings given the various individuals or things. Therefore, an 
individual who is uniformly ranked high will have a small average, 
and a high position in the final order of merit no matter whether his 
ratings were in lists of 5, 10, or 15. The judge who ranks a few cases 
simply has less weight in determining the size of the average, and hence 
the final position, than the judge who ranks many cases; and this result. 
is probably the desirable one. If A is ranked No. 1 and Z No. 5ina 
list of 5. 7.e., first and last, and A is ranked No. 1 and Z No. 10 ina 
list of 10, the gap between A and Z in the first instance is 4, and in the 
second instance it is9. Persons ranked “high” and “low”’ are closer 
together in a short list than in along one. The final order as given by 
the Average Method is, therefore, as close to the true order as that given 
by the methods using “weights”’ since A, if ranked high in all the lists 
will have a small average and a high rank in the final order of merit, no 
matter how his ratings are weighted; while Z if ranked low in all of 
the lists will have a large average and a low rank in the final order. 
Whether this seems to be reasonable or not, the fact remains that 
the Average Method gives as accurate results as those methods, the 





r (Average Complete order and Average Incomplete order).................. . 922 

r (Average Complete order and SD order).................0 cc ccceccsccecees .910 

r (Average Complete order and Comparison order)........; TS ee ude’ a aatoleats . 906 

r (Average Complete order and Percentile order)............0.....000ceeeee .903 

r (Average Complete and SD complete order)..................00ccceeceees .995 

B 

AD of any position in the Average Incomplete from the Average Complete order........... 3.05 
AD of any position in the 8D order from the Average Complete order.................... 3.46 
AD of any position in the Comparison from the Average Complete order.................. 3.67 


AD of any position in the Percentile from the Average Complete order..................-. 3.35 
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SD and the Percentile, in which the ratings are “weighted’’ in accor- 
dance with the length of the series. 

In Table VI some of the facts under discussion are brought out 
more clearly. The illustration is the one given by M. J. Ream in his 
article, already referred to. As an example of the SD Method, Ream 
combined the partial rating lists of 10 sales managers, who ranked 
from 3 to 10 salesmen out of a total of 13. The final order as given by 
the Average Method correlates with the SD order as given by Ream 
.976. The order as found by the Percentile Method correlates 1.00 
with the SD order. There is little to choose as among the three 
methods; in point of time required, the order would be Average, Per- 
centile, andSD. _ If the lists containing the ratings of 3 and 4 individu- 
als are dropped out, the order is little affected; it correlates .976 with 
the order which includes all the lists. If the lists containing the ratings 
of 9 and 10 individuals are dropped out, the order now correlates .94 
with the order which contains all the lists. If the three shortest lists, 
e.g., 3, 4, and 5, are dropped out, a combination of the lists that remain 
(using the Average Method) gives an order of merit that correlates 
.98 with the order found from all the lists. If the three longest lists, 
e.g., 8, 9, and 10, are dropped out, the order from the lists that remain 
correlates .92 with the final order found from all the lists. This would 
seem to substantiate the statement made above that the judge who 
ranks few cases has less weight than the judge who ranks many cases; 
his ratings are smaller numerically, they affect fewer cases, and have 
less effect on the averages from which the final order is determined. 
In spite of the fact that it. would seem logically necessary to weight each 
rating according to the length of the list in which it occurs, practically 
this is not true. Thus if an individual is given a rank of No. 1 ina 
list of 10, his percentile rank would be 5, and his SD rank would be 
1.76; the individual ranked last in the same list would get a percentile 
rank of 95, and an SD rank of —1.76. The two persons have the 
highest and lowest ranks possible by all three methods, and the 
averages, which affect the final order of merit would be increased or 
diminished in exactly the same way. Of course the gaps between the 
averages would vary with the weighting system, but these gaps are not 
necessarily, nor even often, equal as the final order of merit is in no 
sense a scale. 

The question of the variability of the final position of a given 
individual or thing is interesting, though not particularly relevant to 
the present study. Suppose an individual A is ranked first in each of 
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TaBLE VI.—A ComPaARISON oF THE SD, PERCENTILE, AND AVERAGE METHODS 
Autso SHOWING THE ErFrecT oF PARTIAL Lists oF 
DiFrFERENT LENGTH ON THE FINAL ORDER OF MERIT WHEN THE INCOM- 


FOR THE SAME Data. 


PLETE Lists ARE COMBINED. 


TAKEN FROM REAM, AND Is Not WorkeEpD OvT IN THE 
PERCENTILE VALUES APPEAR ABOVE 


TABLE. 
Salesmen.............. A 
No. Judged (95) 
(8) 8 
(6) 
(3) 
(65) 
(10) 7 
(5) 
(7) 
(69) 
(8) 6 
(4) 
(65) 
(7) 5 
(61) 
(9) 6 
Sums: Ranks.......... 32 
Percentiles............ 355 
Averages 
I A 5 nis dria ei casmiont 6.4 
Percentile........... 71 
O. of M. 
Average Method....... 10 
O. of M. (SD) 
from Ream............ 10 
O. of M. 
I 6 tnd dikes weds 10 


r (SD Method and Average Method) 
r (SD Method and Percentile Method) 


THE RANKS 
B Cc D E F 
(19) (82) (44) (6) 
ae 1 
(41) (8) (25) 
3 1 2 
(50) (17) 
2 1 
(5) (25) (55) (45) 
1 ee ae 
(10) 
: 1 
(7) (65) 
1 5 
(19) (57) (6) 
— 1 
(13) (38) 
i 2 
(7) (50) (36) 
1 ee 
(6) (72) 
1 7 
ae he i ee 
148 109 324 125 102 
15 2.8 5.8 4 1.3 
19 27 65 42 17 
: ££ 2 1 
oc wah eta’ 1 
. ££ @ "Ss 1 


G H 
(57) 
5 
(76) 
5 
(85) (35) 
9 4 
(50) 
3 
(94) 
7 
(44) 
4 
(83) (28) 
8 3 
29 19 
338 214 
7.3 3.8 
85 43 
13 6.5 
12 7 
12 7 


I 


(93) 


(95) 
10 
(90) 

5 
(95) 
(88) 
(94) 


(94) 


49 
649 


93 


12 


J K 
(69) 
6 
(58) 
4 
(75) 
8 
(36) (79) 
3 6 
(82) 
7 
(21) (79) 
2 6 
(39) 
4 
13 33 
154 384 
3.3 6.6 
38 77 
>. = 
oe 
5 ll 


ee 
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ILLUSTRATION FROM M. J. Ream, 
Journal Applied Psychology, 1921, 266. Tue SD Orper Is 


L M 
(32) 
3 
(84) 
3 
(15) 
2 
(30) (70) 
2 4 
(21) ( 50) 
2 4 
(32) 
3 
(63) 
3 
(17) (50) 
2 5 
14 19 
147 317 
2.3 3.8 
25 63 
3 6.5 
3 8 
3 8 
.976 
1.000 


Correlations of various combinations of incomplete lists with order above by Average Method 
O. of M. (Average Method) Lists 3 and 4 out 


10 


2 


O. of M. (Average Method) Lists 9 and 10 out 


13 


2 


O. of M. (Average Method) Lists 3, 4 and 5 out 


10 


2 


O. of M. (Average Method) Lists 8, 9 and 10 out 


13 


3 


4 7 9 1 
4 9 6.5 1 
4 9 6.5 1 
2 9 6.5 1 


12 6 13 5 ll 
10.5 8 10.5 5 12 
12 6.5 13 5 il 
11.5 8 10 


3 ® 
r = .976 


3 6.5 
r = .04 


3 8 
r= .9R 


5 11.5 4 6.5 


r = .92 
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four lists of 5, 10,8, and6. His ratings by the Average Method would 
all be No. 1; by the Percentile Method they would be 10, 5, 6, and 8; by 
the SD Method they would be 140,176,167, and 149 (dropping decimals). 
A’s variability by the Average Method would be 0, and since he is 
given first place in each list this would seem to be fair. However, 
from a different point of view a rank of No. 1 in a list of 10 is actually 
a higher rating than a rank of No. 1 in a list of 5; and using the per- 
centile ratings we may (for the problem above) give A a percentile 
average of 7.25 with an AD of 1.75, or using the SD ratings an average 
of 158 with an AD of 13.5. Such measures of variability would be 
meaningful in comparison with the variability of the other individuals 
rated. In most cases the worker with partial judgment lists is content 
if he can combine his incomplete lists into an approximation of the real 
order of merit; and the question of the variability of the final positions 
is not of any great importance. If some measure of agreement is 
desired, however, the Percentile Method would seem to offer the 
simplest way of finding it. 

A test of the Average Method and the SD Method when only 
complete lists are used was made in Table V. Here the 16 complete 
lists of the 40 psychologists were combined by averaging the numerical 
ratings on each individual, and by averaging the SD ratings. The 
correlation of the two orders of merit was .995. Except in cases when 
two or more individuals are tied for a given place we should expect 
perfect or nearly perfect correlation since, to repeat what has been said 
above, the final order depends on the size of the averages. 

One other test of the Average Method was made using the data 
given by Thorndike in his illustration of the Comparison Method. As 
mentioned before, Thorndike illustrates his method by combining the 
judgments of 18 individuals on the general intelligence of 34 freshmen. 
The judges ranked only those men whom they knew, their lists con- 
taining from 14 to 29 names. If these incomplete lists are combined 
by the Average Method, the resulting order of merit correlates .95 
with the order given by Thorndike using the Comparison Method. 
Since neither order is more than approximately correct, in view of the 
high correlation between them and the far more laborious computation 
which the Comparison Method necessitates, the Average Method 
would certainly seem to be preferable. The writer is doubtful whether 
the far more refined technique is worth the trouble. 

A summary of the facts which this study seems to have brought 
forth is given below on following page. 
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1. The final order of merit obtained from incomplete judgment lists 
by the SD, Average, Percentile, or Comparison Methods tallies very 
closely with the best “standard” or “‘true” order. The more numer- 
ous the partial lists, and the longer the lists (other things being equal) 
the closer the agreement of the final order from the partial lists and the 
final order from complete lists. 

2. The order of merit obtained from scattered and sparse data may 
be very inaccurate when judged by the order obtained from complete 
lists. With such data no method gives accurate results. 

3. Judged from the standpoints of simplicity and time required, 
either the Average Method or the Percentile Method is superior to 
the other methods. As far as accuracy is concerned, the Average 
Method is certainly as good as the other three, if not better. 

4. When the variability of the individuals or the things rated is 
desired, either the SD or the Percentile Methods may be used to 
advantage. Of the two, the Percentile Method is the more simple as 
it avoids the use of negative quantities. 





HERRING REVISION OF THE BINET-SIMON TESTS: 
JOHN P. HERRING 


Director, Bureau of Research, New Jersey Department of Institutions and Agencies 


The purposes evolved in the construction of the Herring Revision 
of the Binet-Simon Tests are these: 

1. To build a new series of individual intelligence tests which 
assume our present criteria and correlate with them as highly as possible. 

2. To study the reliability of the Stanford-Binet examination and 
of such new series as are made from time to time. 

3. To discover whether the best material for the content of such 
examinations has already been used, or whether there is rather an 
unlimited amount of it in the universe of human behavior. 

4. To make progress in the construction of a number of individual 
intelligence examinations sufficient for the needs of psychologists and 
of educational institutions generally. 

5. To extend the upper and lower levels of these examinations 
so that human beings of all ages may be measured. 

6. To mechanize the whole procedure of individual measurement 
so far that any teacher with at least average intelligence and zeal 
can measure the mentality of her pupils with a PE of estimate as 
low as four mental months during the first few examinations made and 
therefter as low as three mental months, and this accomplished 
examiners perhaps can bring this as low as two mental months. 

7. To publish an instrument so simple and precise that the 
Utopian dream of measuring all the children of the United States may 
come true through the cooperation of rural and urban teachers.’ 

8. To make rules for the safe and systematic omission of portions 
of examinations falling too far below and above individual examinees, 
so that intensive testing on each level may be possible. 

9. To standardize to some point of diminishing returns every 
series of responses needing it, and to give preference to tests for which 
the keys are mechanically used. 

10. To make short forms of several of the examinations by omitting 
from the procedure, other things equal, tests that take much time, that 


1“The Herring Revision of the Binet-Simon Tests, and Verbal and Abstract 
Elements in Intelligence Examinations.”” World Book Company, to appear. 

2 On the feasibility of our public school teachers doing such work, see Terman, 
“The Intelligence of School Children,”’ Chapter XIII. 
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are in any way irksome or awkward to administer, that require appara- 
tus, and that show inferior correlations, and that are less reliably 
handled by inexperienced than by experienced examiners. 

11. To arrange and organize the tests of an examination so as 
to permit of the independent administration and standardization of 
several limited portions, each of different length, and each having 
norms corresponding to mental age equivalents. 

12. To publish individual examinations which may be administered 
and scored in an amount of time approaching in economy that of 
group examinations; and perhaps even to construct individual exami- 
nations which, if of equal administration time per child, shall be more 
reliable than the best group examinations; and if of reliability equal 
with the best group examinations, shall require shorter administration 
time per child (Administration time is here used to include giving, 
scoring, and tabulating). If this can be done, it may be possible 
also to make individual examinations of which the reliability is higher 
and the administration time less than those of group examinations. 

13. To stimulate the use of individual examinations as against 
group examinations whenever the measurement of individuals is 
primarily concerned, and whenever it is desired to know accurately 
the SD of the mental ages or IQ’s of a group. 

14. To be able in the case of intelligence to replace the less satis- 
factory device of correction for chance errors, with the more adequate 
policy of obtaining precise original measures. 


15. To make two complete individual intelligence examinations . 


embodying as many as possible of the foregoing suggestions, and in 
addition, composed only of exercises requiring the minimum stand- 
ardization of responses, so that both administration and scoring are 
objective, mechanical, and devoid of difficulties, for the purpose of 
obtaining a reliability coefficient lying as much above 0.99 as possible. 

It is evident that these purposes involve much that is not yet 
accomplished. 


USES OF THE REVISION 


Among the uses to which the Herring-Binet may be put are the 
following: 

1. The determination of mental ages and intelligence quotients. 

2. The verification of mental ages and intelligence quotients 
already obtained by means of other individual examinations. 
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3. The re-determination of individual mental ages obtained by 
group examinations. 

4. The very accurate determination of mental ages and intelli- 
gence quotients, through averaging the results of the Herring and 
of the Stanford. 

5. The study of the reliability of individual interview technique 
as compared with that of group technique in the measurement of 
intelligence; particularly of the Stanford and of the Herring. 


6. The study of the correlations, including the reliability and the 
validity of each component test. 


HIsTORY OF THE REVISION 


During the fall of 1920 a number of sources were reviewed for 
material for new individual tests. The Stanford itself furnished a 
large portion of the suggestion; indeed the attempt was made to 
imitate the form of as many of the tests as it seemed feasible, and 
at the same time to use new content. A study was also made of many 
educational and mental examinations, both individual and group, 
including the National Intelligence Tests, the Kelley-Trabue Com- 
pletion Tests, the Thorndike Entrance Examinations, and many others. 
The material used comes from many and varied sources, and has been 
freely modified. 

At the same time, that which has grown into the present series 
was assembled, tried with a few subjects, revised and re-tried 
repeatedly until it appeared ready for such extended use as is 
associated with publication. 

The subjects examined at first as a basis for evaluation and stand- 
ardization were 154 in number; they ranged from age 4 to age 18, 
from mental age 53 months to mental age 216 months, and from IQ 
0.42 to IQ 1.60. They are children of the Garden City public school, 
of the Letchworth Village Institution for the feeble-minded, and of 
a private school in Scarboro, New York. The one who made the 
tests administered less than one-half of the examinations, and obtained 
about one-eighth of the Stanford 1Q’s. Those who determined the 
other IQ’s in both examinations were unusually competent investiga- 
tors, and candidates for the doctorate in Educational Psychology at 
Teachers College. Four thousand to five thousand additional sub- 
jects have been examined in Pennsylvania, most of whom have not 
yet had a Stanford Examination. 
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RELIABILITY 


One hundred and twenty-six children ranging in age from 8 to 13 
years inclusive and correlated separately in age groups yield an average 
Pearson Product Moment Correlation between the Stanford-Binet 
and Herring-Binet of 0.981 for mental ages and 0.982 for intelligence 
quotients. One hundred and fifty-four children, including the 126 
already mentioned and a scattering of younger and older children, 
who range all together from 4 to 18 years old, yield a correlation of 
0.987 for mental ages and 0.980 for IQ’s. The class intervals used in 
these correlations were 5 months and 5 points in IQ. These 7r’s were 
obtained by using Group E of the Herring-Binet. An r of 0.985 was 
obtained from the same 154 cases by counting only the tests of Group 
C (Class interval 5). The correlation of Group A scores with the 
Herring-Binet Mental ages is 0.953, n being 154. 

- The best available estimate of the reliability of the Herring exam- 
ination under conditions of careful use is: r = .9908 + .001 and 
.67449+/1 — r?= 3.71 points in IQ, where n = 72, and with the cor- 
relations of the previous paragraph. The 2 7’s of this paragraph are 
superior to those of the preceding because they are obtained from 
ungrouped data, the class interval being one point in IQ. The 
second and higher r (0.9908) and its lower probable error of estimate 
(2.37 IQ points) are to be preferred to the others because during the 
first 72 examinations the tests were themselves subjected to constant 
and free modification of structure, content, and administration, while 
during the following 82 cases they had settled down to comparative 
stability. 

There is sufficient difference in the reliability of Groups C, D, and 
E, to warrant the recommendation that at least D be used whenever 
individual welfare is at stake. Eis better. In the original data, they 
all correlate with the Stanford about 0.98 or 0.99 in unselected age 
groups. Group A correlates about 0.95 and Group B about 0.97. 
For the purpose of obtaining only fairly reliable mental ages, Group 
C is recommended; for rapid work, Group A. The difference between 
the correlations of 0.97 and 0.99 may to some appear trivial. Such, 
however, is not the case for a certain number of individuals, 0.99 
being really much preferable owing to the decreased number of indi- 
viduals who can be mismeasured by significant amounts. As long as 
we have instruments with a precision indicated by the correlation of 
0.99, we have little right to mismeasure individuals by the use of 
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those less precise. The very best measure obtainable is probably that 
based upon both Stanford and Herring mental ages being averaged 
and the examinations being given on different days. 

Group A gives a measure of intelligence which, for reliability, if 
we may judge by the original data is at the least equal and perhaps 
superior to the mental ages usually found by means of group tests and 
distinctly inferior to those usually found by means of longer individual 
tests. Group A is administered in from 2 to 15 minutes, averaging 
6 minutes. 

For the purpose of obtaining the most precise measures possible, 
nr 
1+(n-1)r 
Stanford plus the Herring and two other similar tests, of 0.9954 with 
a PE of estimate (assuming a sigma of 25 points in IQ instead of 26 

as obtained in r = 0.9908) of 1.61 points in IQ. . 
PEARSON Propuct MoMENT CORRELATIONS BETWEEN STANFORD-BINET AND 


HERRING-BINET, ACCOMPANIED IN EacH CasE BY PROBABLE ERROR 
or Estimate (0.674490+/1 — r?) 





Spearman’s formula ran = gives an r, as between the 
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The correlation and PE +~/1 — r? of the 72 cases were obtained 
while the tests themselves were being constantly and freely modified; 
those of the 82 cases after the tests had settled into almost final form. 

The curvilinear correlation (Rugg: Statistical Methods, page 278) 
between the scores of Group A, composed of four tests requiring six 
minutes average administration time, and the Herring-Binet Mental 
Ages, in 154 cases is 0.953. The formula is as follows: 
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VALIDITY 


The Stanford-Binet examination has been assumed as a criterion 
of intelligence, notwithstanding Terman’s own insisténce that new 
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criteria be constantly sought, and Thorndike’s suggestion that criteria 
of executive and of mechanical ability be used in intelligence examina- 
tions. To study such new criteria is no less than a crying need, but 
this falls outside the present attainment of this study. The correla- 
tions with the Stanford are so high that whatever defects as to criteria 
obtain in one must to an almost equal degree obtain in the other. 


CONCLUSIONS 


1. Within the limitations of the study thus far, both the Stanford 
Revision of the Binet-Simon Tests and the new series of individual 
tests possess high reliability. 

2. By using both tests and averaging the mental ages resulting, 
we may obtain a measure of intelligence of very high reliability. In 
addition to the~influence of greater amount of test material upon 
reliability, we have the advantage of using a whole examination upon 
two different days. 

3. The new series, provided many investigators and a thousand 
cases yield results of the same character as those which three 
investigators and 154 cases have yielded, is a valid measure of the 
thing measured by the Stanford. To the degree which certain conclu- 
sions reached in an article in an earlier issue! and in an article by Gates? 
are true, the objection, that the Stanford and the new series are com- 
posed too exclusively of verbal and non-concrete matter, is not valid, 
and for all that objection, we may still conclude that tests of this type 
measure intelligence. 

4. There was no great difficulty encountered in the search for 
material for the content of an individual intelligence examination. 
There is no reason to suppose that another series could not be made 
with the same degree of facility, so far as the collection of content is 
concerned. Indeed several times the bulk of the Herring has already 
been collected without special difficulty. This does not mean that 
it is easy to make examinations, but it suggests definitely that instead 
of having nothing left in the universe of individual test material but 
the skim-milk, there seems to be a practically limitless reservoir of 
whole milk. 





1 Herring, J. P.: Verbal and Abstract Elements in Intelligence Examinations. 
Journal of Educational Psychology, December, 1921, Vol. XII, No. 9, pp. 511-517. 

2 Gates, A. I.: Correlations of Achievement in School Subjects with Intelligence 
Tests and Other Variables. Journal of Educational Psychology, March, 1922, 
Vol. 13, No. 3, p. 137. 
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OUTLOOK 


It seems not entirely chimerical to hope that a reliability correlation 
between two twenty-five-minute examinations of intelligence may be 
found as high as about 0.995; the probable error of one such examina- 
tion as low as perhaps four weeks, and of two, perhaps three weeks. 
This hope is based upon an obtained Pearson product moment correla- 
tion of 0.991 + 0.001 between the Stanford and a thirty-five-minute 
examination, plus the conviction that several devices, each tending 
to increase correlation, are open to further perfection. Among these 
devices are the folowing: 

1. Those already mentioned under PURPOSES. 

2. Scoring every test element either 0 or 1, as against such scoring 
as 0 or 3, or 0 or 5, methods which occur both in the Stanford and in 
the Herring. 

3. More rigorous elimination of tests showing inferior correlations. 

4. A broader search for new types of tests yielding superior 
correlations. 

5. Arrangement to have all the examinations administered by 
the best examiner available, by the same examiner, with the same time 
interval between examinations, and with the most scrupulous preci- 
sion in detail. A reliability resulting from such pains is to be distin- 
guished from that produced by teachers working under average 
conditions. The former may perhaps yield 0.995; the latter, only 
0.980. This difference is significant, as those know who are familiar 
with the value of different portions of the correlation scale from 
0 to 1 as related to prediction. 

6. Dividing the total time of the examination into short units, 
one for each test element, eliminating test elements that take much 
time, so that there may be as many as two units for each month of 
mental age. 

Educational as well as mental examinations, it is possible, may 
yield by means of similar methods equally high reliabilities. The 
Stanford Achievement Tests confirm this surmise. 

Should such hopes be fulfilled, then it would be possible to talk 
of precision as we do in Physics. How exact is the measurement 
of a line as compared with the measurement of ability? Such com- 
parisons can be made if purposes be admitted ascriteria. 'Wemeasure 
lines as closely as many purposes demand. So we can measure intelli- 
gence also. Lest this receive a naive interpretation, the obvious 
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must again be said; intelligence here means that which is measured. 
For what purpose can we at present utilize a measure closer than one 
which is half the time only seven mental weeks different from the 
truth? For pure psychology, perhaps for some; for placement of 
individuals, perhaps for few. 

That there is enough left to do in the further perfection of instru- 
ments will be readily admitted: inventing a unit of measurement 
more demonstrably equal in some sense for all levels of intelligence 
(perhaps McCall’s T-Scale); introducing new criteria of intelligence; 
extension upward and downward; multipiication of scales until there 
are enough; multifarious organization of examinations to secure a 
variety of secondary puposes; production‘of an indefinite number of 
homogeneous (correlational criterion) trait scales, the composite of 
which measures intelligence. 

Instruments exist by which every elementary teacher of average 
intelligence can ascertain the mental ages and IQ’s of her pupils, 
provided she have interest and faithfulness. It must not be admitted 
that this generation of children may pass through school unmeasured. 
Teachers can measure them, and owing to the ratios of the number 
of trained psychological examiners and of teachers to the number of 
pupils in the United States, the teachers alone can. This seemingly 
Herculean task, if organized as our government organized war, could 
be accomplished with substantial accuracy in six months. A national 
university, clothed with adequate authority could do this. 


: 





ONE INDUSTRY’S ATTITUDE TOWARD SELECTION 
BY MENTAL LEVELS 


C. MARCUS WIENAND 


Communicated by Frederick E. Bolton, University of Washington 


A large corporation, finding it necessary to assign certain of its 
workmen for special training to handle new equipment being installed 
by that company, made an interesting investigation as to various 
methods of selecting these men. Since intelligence tests were investi- 
gated and considered not feasible, the observation of a company 
official on the use of mental testing will be of interest to all who are 
concerned with the use of intelligence tests. First, however, permit 
me to describe the situation calling for the selection of the men. 

This corporation operates throughout the United States and would 
be well known toall were I at liberty to divulge itsname. The problem 
arose in a certain Western city where large replacements of equipment 
became necessary due to the rapid growth of that city. The company 
decided to install the very latest type of equipment developed in its 
engineering laboratories. However, this new type was so radically 
different from any then in use that none of the workmen employed 
by the company, outside of the engineering department, were able to 
work on it without special training. Three years of apprenticeship 
are required to learn to handle the old equipment, which is regarded 
as far more simple than the new. Hence the intensive training neces- 
sary had to be equivalent to many years apprenticeship. Further- 
more, since the new apparatus was to be installed in successive units, 
groups of men would have to be trained for each unit. Since the first 
group would have to be ready before the first installation was complete, 
it can be seen that their instruction would have to be from blue prints 
and written descriptions. Hence the first group, consisting of about 
20 men out of the 125 employed by the company in this city, would 
have to be a select group as far as mental ability is concerned. The 
method used to train these men was this. They were allowed to put 
in four hours a day at their regular work and spend the other half of 
the day in school. Approximately eight months were required to 
put the group through the essentials of the new equipment. 

The basis on which the men were selected was: The service record, 
i.e., length of service with the company, ability as shown in the super- 
vising foreman’s estimate of daily work and faithfulness to the com- 
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any’s interests. From time immemorial, the same bases have been 
used to decide fitness for promotion or special work. This does not 
mean that the men in charge were not acquainted with the possibilities 
of the newly developed intelligence tests. To the contrary, they gave 
this method of selection their earnest consideration before deciding in 
favor of the older methods. Fortunately, I was able to secure an 
interview with the man in charge of the educational program of the com- 
pany in this district. In all probability, his opinion and recommenda- 
tions will be accepted as the policy of the company in other places 
where the same problem arises. This means then, that I am giving 
the probable opinion of one of the largest employers in the country on 
the feasibility of selecting employees for special training on a basis of 
their comparative intelligence as revealed by the intelligence tests. 

“While it is true that the intelligence tests offer a quick and efficient 
method of determining superior mental capacity, we could not afford 
to use it. In actual dollars and cents it would represent a big loss to us 
in the reaction of our employees. At present, we have a fine spirit of 
cooperation on the part of our employees. The use of the intelligence 
tests, however, would wreck the morale of our employees. To say to 
one man, ‘you can go to our special school, because you rank higher 
than your fellow workman in a group test,’ and to another of perhaps 
many years service, ‘You cannot go because relatively you are of inferior 
intelligence,’ would have a very disastrous effect. 

“In actual practice the intelligence tests will not be feasible because 
it disregards the human element and considers the individual only as an 
increment of the group. We must regard our employees as human 
beings and overlook many human frailities in the knowledge that they 
are making our work, their life work and are giving their best according 
to their individual ability. Since a selection is necessary, the only 
basis for selection that the men consider just, is that of the service 
record with the possible exception of cases where an employee is not 
doing standard work. 

‘‘We soon realize in our school that some could grasp the new work 
much more rapidly than others but it is our intention to drill all thor- 
oughly in the work. In order to achieve our ends, every man complet- 
ing the instruction must be a thorough master of the new details; 
hence we are not confronted with a problem of possible failures, but 
of adjusting the instruction so that all can thoroughly master it. We 
do not expect to receive any students below average intelligence, since 
the nature of our work so far, requires at least average intelligence.” 
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One other question occurred to me in thinking over what the in- 
structor has said. Hehad admitted the variation in intelligence among 
those selected. In the plan of instruction, the men were actually 
divided into four classes in order to secure more individual instruction. 
Why could not the men be grouped according to mental ability and each 
class move forward according to the ability of its members? The 
answer to this question is particularly interesting to educators. 

“On the contrary we have gained considerable, by adopting the 
opposite course. We pair a slow man with a fast one and give all to 
understand that they must master the work. The result is that the 
fast man actually coaches the slower man thereby impressing the in- 
struction so much more thoroughly in his own mind. It has seemed 
to me that a man who masters new instruction rapidly does not, as a 
rule, take pains to firmly impress this upon his mind. On the contrary, 
a slow thinking man is usually able to retain what he has learned to a 
greater degree. In our work here the fast man attempting to coach 
the slow man gains this firm foundation and thus is benefited to the 
same extent as the slow man receiving the coaching. Since we have 
never attempted to rush any man but give each ample time to absorb 
all instruction, our results have been 100 per cent successful in our 
object of thoroughly training all men who come to us. We have 
received the favorable attention of the higher officials of the company 
for the thorough manner in which our work has been done and we have 
been asked to furnish full reports which will, in all probability, be 
used as the basis policy of any educational program inaugurated by 
this company in other territory.” 

No doubt, exponents of the intelligence tests will find much to 
criticize in the attitude of the man whose statements I have just 
summarized. To them, I will say that they must take his viewpoint, 
that of an employer dealing with employees, and not an investigator 
in the field of mental testing. He is not interested in any other results 
than to furnish specially trained men to handle the new equipment 
when installation is completed. Any gain to the scientific world as a 
result of the use of the tests or even momentary gains through the 
efficiency effected by their use can be more than offset should the 
company lose the confidence of its employees. What I have given you 
then is the attitude of an employer on intelligence tests. 
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COMMUNICATIONS 





CONCERNING EZEKIEL CHEEVER 
To the Editor of the Journal of Educational Psychology. 


Sir: This is not an author’s exception to a reviewer’s criticism- 
Seldom, if ever, is there complete agreement between the estimate a 
writer places on his product and the appraisal recorded by his critics. 
Consequently, whenever, I observe a writer seeking to set a critic right 
in the pages of that critic’s journal, I suspect the author of a sly attempt 
to get his own estimate of his work into print. I am guilty of no such 
chicanery. 

But a review in your October issue may have conveyed to some of 
your readers the impression that I am presumptuous, if not down-right 
sacrilegious, in my occasional use of “Ezekiel Cheever’’ as a penname. 
Miss Zirbes, reviewing “Of What Use Are Common People?,”’ 
referred to me as the “Prophet Ezekiel,’’ she commended my use of 
“modernized mixed metaphors,” and she delighted in my “apocalyptic 
style.’ The original Mr. Cheever was not a prophet; he was fairly 
modern; and to my knowledge he took no hand in producing apocalyp- 
tic literature. My fear is that Miss Zirbes has confused my Ezekiel 
with one of the Bible-makers. The William Jennings Bryan school of 
thinkers would doubtless freely forgive a critic of educational books 
for an educational journal who revedled wider acquaintance with the 
Bible than with educational history. Still it appears to me that even 
a superficial knowledge of the Bible should have spared your critic 
from confusing Ezekiel Cheever, a New England schoolmaster, with 
a certain prophetic gent who held forth some six hundred years B. C. 
The prophet’s surname was Goldstein. 

H. E. BucnHo.z. 
Baltimore, Jan. 25, 1924. 
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PSYCHOLOGY IN CURRENT ISSUES OF 














REPORTED BY CECILE COLLOTON 
Department of Educational Psychology, The Lincoln School of Teachers College 


INTELLIGENCE TESTS 


The Problem of the Intelligence Test. Warren W. Coxe. Educational Review, 
1924, January, 73-77. Discusses (1) various conceptions of general intelligence, 
(2) what intelligence tests measure and (3) influence of test results on educational 
practice. 

What is Measured by Intelligence Tests. Omen Bishop. Journal of Edu- 
cational Research, 1924, January, 29-38. Teaching children material similar 
to that found in group intelligence tests results in greatly improved scores. Dis- 
cusses various factors influencing test scores. 

A Study of Intelligence Scales for Grade II and III. J. Cayce Morrison, W. B. 
Cornell, and Ethel Cornell. Journal of Educational Research, 1924, January, 
46-56. An attempt to determine objectively which of six tests would prove most 
useful in Grades II and III. According to the data, the Otis, Haggerty, and 
Dearborn tests are equally satisfactory. 

The Speed Facior in Mental Measurements. Giles M. Ruch. Journal of 
Educational Research, 1924, January, 39-45. Experiments with the Terman 
group test and the reading and arithmetic tests of the Stanford achievement tests 
show that the speed factor does not materially influence the ratings of the pupils. 

The Intelligence of Mexican Children. William H. Sheldon. School and 
Society, 1924, February 2, 139-142. A comparative study of 100 white and 100 
Mexican children of the same age and school environment shows the average 
Mexican child to be about 14 months below the normal mental development 
for white children. 


MISCELLANEOUS 


A Test and Teaching Device in Citizenship for Use with Junior High School 
Pupils. Clara F. Chassell and Ella B. Chassell. Educational Administration 
and Supervision, 1924, January, 7-29. Describes in detail a test consisting of 
7 stories and 12 accompanying problems, each problem being followed by 10 
possible consequences of the course of action under consideration. Affords 
opportunity for training pupils to regulate conduct in the light of carefully con- 
sidered consequences. 

A Test of Ability to Weigh Foreseen Consequences. Clara F. Chassell, Ella B. 
Chassell, and Laura M. Chassell. Teachers College Record, 1924, January, 
39-50. Defines motive as the ability to weigh foreseen consequences and describes 
a test designed to measure this aspect of character. An account of the construc- 
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tion of the test is given and a report as to its reliability and its correlation with 
other measures. 

A Counseling Plan for Bridging the Gap between the Junior and Senior High 
Schools. Margaret M. Alltucker. The School Review, 1924, January, 60-66. 
Describes a plan used for two years in Berkeley, California for meeting the indi- 
vidual needs of pupils. 

Handling the Superior Child. Louis A. Peckstein. Educational Administra- 
tion and Supervision, 1924, January, 1-6. Discusses the difficulties involved in 
special classes for gifted children. 

Conservation and the Free Method. Anna B. Palm. Chicago Schools Journal, 
1924, January, 175-177. Providing for individual differences by encouraging 
freedom of thought and action under wise control. 

Proposals for Siandardizing Measurement in Education. II-The Coefficient of 
Application. Harry 8S. Will. Journal of Educational Research, 1924, January, 
57-62. Complicated statistical formulas for measuring educational attainment 
in terms of application and effort. 

Next Steps in Educational Surveying. William H. Allen. Educational Review, 
1924, February, 78-80. Proposes a survey “‘credo”’ in the form of a dozen simple 
principles—the following of which will make the school survey more vital, effi- 
cient, and constructive. Eight special fields of inquiry are suggested. 

Casual Practice versus Practice under Instruction. George E. Freeland. The 
Journal of Educational Method, 1924, January, 203-206. Careful planning, 
analysis, and guidance by the teacher are invaluable to the learner. 

Factors Influencing the School Success of the Blind. Louis A. Pechstein. School 
and Society, 1924, January, 12. Conclusions based on 100 girls and 140 
boys in the New York State School for the Blind, Baravia, New York. Eight 
tables of statistics. 

The Importance of Intelligent Silent Reading. William S. Gray. The Ele- 
mentary School Journal, 1924, January, 348-356. Why specific and systematic 
instruction in silent reading must be stressed in the elementary school. 

Mental Discipline. W.C. Ruediger. School and Society, 1924, February 9, 
171-172. Answers from 177 college students indicate that mental training, or a 
technique and standard of work, may come from any subject in which the student 
is interested and which is taught effectively. 

Teacher Failures in High School. 8S. P. Nanninga. School and Society, 
1924, January 19, 79-82. Questionnaires answered by city superintendents give 
discipline, lack of cooperation, and poor instruction as the most frequent causes of 
failure among high school teachers. 

The Vocational Changes of 1000 Eminent American Women. Harry D. Kitson 
and Lucille Kirtley. School and Society, 1924, January 26, 110-112. A study 
of the first 1000 women in “ Who’s Who in America”’ for 1922-1923 shows that of 
the 11 per cent who have changed their vocation 62 per cent changed more than 
once and 30 per cent more than twice. Indicates the need for vocational guidance 
in colleges. 
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CONDUCTED BY LAURA ZIRBES! 


A PsyYcHOLOGY WITH APPLICATIONS TO EDUCATION 


Psychology for Students of Education, by Arthur I. Gates. New York: 
The Macmillan Co., 1923. Pp. XVI + 489. 


The title of this new educational psychology, Psychology for Students 
of Education, indicates its character. The subject-matter of educa- 
tional psychology, as represented either in textbooks or in college or 
normal school courses in the subject, varies all the way from “pure’”’ 
or “general’”’ psychology, with incidental applications to teaching, at 
one extreme, to discussion of method at the other. The present book 
is primarily a general psychology in the order and selection of the topics 
and in its emphasis upon the general analysis of mental functions; 
but the very large emphasis which is given to the treatment of instinct 
and learning permits frequent and direct educational applications. 
Furthermore there are two chapters on mental and educational tests. 

The book opens with chapters on the methods of psychology, on 
the sensory, associative and motor mechanisms of the nervous system 
and on the classification and definition of mental processes. This is 
more or less introductory. The remainder of the discussion deals with 
the instinctive basis of behavior, with the modification and control of 
the instincts, ranging from the mere development of the instincts 
themselves to the various forms of learning. It concludes with an 
account of individual differences and the description of methods of 
measurement already mentioned. 

The order of treatment of the main part of the book is, in general, 
as will be seen, that followed by Thorndike in his Educational Psychol- 
ogy. The author also agrees with Thorndike’s analysis of learning as 
consisting of the formation of bonds, and with his statement of the 
laws of learning and identification of the various forms of learning as 
belonging fundamentally to the same type, in which trial and error is 


1 Unsigned reviews were prepared by L. Z. 
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the basic method. But he distinguishes between acquiring motor 
skill, acquiring percepts and ideas and reasoning as at least super- 
ficially different in character and a requiring different methods of 
training. In this concession to the point of view which distinguishes 
more sharply between the different forms of learning the treatment 
gains, in the opinion of the reviewer, in practical value. 

The general content of the book has perhaps been sufficiently 
indicated by the foregoing description. Its form, and the adequacy 
of treatment of the various topics, remains to be commented upon. 
In these respects Dr. Gates has done an exceptionally good piece of 
work. His style is clear, simple and free from mannerisms or crudities. 
The treatment of the various topics is very well proportioned, and is 
compact and yet contains enough illustrations to give the discussion 
concreteness. The judgments on theoretical or practical issues are, 
in general, very well balanced and give evidence of careful and well- 
matured thought. The writer predicts a wide use for this textbook 
among those who wish to combine a general outline of psychological 
principles with frequent reference to the application of these principles 
to educational practice. 


FRANK N. FREEMAN. 





THe Latest WorD ON THE JUNIOR HiGH ScHOOL 


Junior High School Education, by Calvin Olin Davis. Yonkers-on- 
Hudson: World Book Company, 1924. Pp. 451. 


Professors of secondary education continue to write good books on 
their best discovery, the junior high school. Each new volume 
organizes the previous material a little more compactly, reduces more 
of it to a-b-c-d’s and 1-2-3-4’s; and then brings the story down to 
date. Professor Davis’s treatment is typical. His discussions of the 
theory underlying the new school, its historical development, and its 
place and purpose in the educational scheme are orthodox, succinct, 
and arranged to facilitate preparation for impending true-false or com- 
pletion tests. 

His big contribution, however, is the orderly presentation of a mass 
of very significant material on “‘the content, arrangement, and methods 
involved in each of the several subjects of the curriculum.” He has 
brought together here the views of curriculum authorities, the recom- 
mendations of committees—particularly the reports of the Commis- 
sion on the Reorganization of Secondary Education—actual programs 
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and course outlines from progressive states, cities, and experimental 
schools, and a wealth of practical suggestion of all kinds for curricu- 
lum makers in this field. The compilation appears to be decidedly 
worth-while. 

Concluding chapters on administration, collateral activities, build- 
ings, and standards are brief and rather tentative. They are followed 
by very satisfactory bibliographies and textbook lists in several 
appendixes. All in all, this is a well conceived text for college students 
in training, a practical reference volume for school authorities and 
teachers trying to put the junior high school over, and a worthy 
addition to the monographs on the same subject which have appeared 
in the last six or seven years. 

M. H. WILLING. 





Boston IN 1845 versus DeEtrRoIT IN 1923 


Then and Now in Education—1845:1923, by Otis W. Cald well and 
Stuart A. Courtis. Yonkers-on-Hudson: World Book Company, 
1924. Pp. 400. 


On a basis of fact quite within the comprehension of any intelligent 
reader, and in a style of expression free from pedagogical obscurities, 
the authors of this book have given an account of some of the more 
striking changes that have taken place in public education during the 
last 75 years. The lay reader will enjoy this book not only because it 
is intrinsically interesting, but also because it spares him a detailed, 
technical record of the scientific, statistical, and ‘“researchical’’ 
agonies through which the authors must have gone to produce it. The 
professional reader, on the other hand, may safely express his approval, 
since all the sources, tables, and underlying data have been reprinted 
in appendixes filling half the volume. 

The inspiration for the book came from the discovery, or re- 
covery, of the exceptional reports of the Annual Visiting Committees 
of the Public Schools of the City of Boston in 1845. In that year— 
probably in response to a growing feeling that Horace Mann might, 
after all, know what he was talking about—the Sub-committees ap- 
pointed to examine the Grammar and Writing Schools abandoned the 
usual procedure of a perfunctory visit, a superficial oral examination, 
and a stereotyped report. Instead, they devised written examinations 
for pupils in the first class, gave these examinations themselves in a 
uniform way, carefully scored the papers, and made elaborate, statis- 
tically-confirmed reports of their findings to the School Committee. 
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These reports, together with the extended comment of Horace Mann in 
The Common School Journal following their publication, constitute a 
remarkable survey of the Boston School System of that day—a survey 
not very dissimilar in spirit and method from those of our own day. 

From the statements, data, and implications derived from these 
sources, Dr. Caldwell and Dr. Courtis have descriptively reconstructed 
for the modern reader the Boston School System of 1845 in terms 
of organization, administration, supervision, curriculum, textbooks, 
teaching. methods, and buildings and equipment. Following this 
they have reported the results from a retesting in 1919, with selected 
questions from the Boston tests, of no less than 12,000 eighth grade 
pupils in schools all over the United States. Recognizing that even 
this selected test material is too far removed from modern school 
experience to be a fair measure of learning or teaching to-day, one is all 
the more gratified to learn that the “results taken asa whole . . . 
indicate plainly the improvement of instruction.” 

What the organization, management, and support of public educa- 
tion really mean in a large, progressive American city to-day, as 
contrasted with what they meant in the Boston of 1845, is explicitly 
and enthusiastically set forth in the last chapter of the book. There, 
too, the modern experimental school is given credit for its special 
contributions to educational science, and its advanced practices are 
cited as the basis for a series of prophecies concerning the schools of 
tomorrow. The effect of this portion of the book is to clarify one’s 
understanding of educational progress, to increase his pride in it, and 
to stiffen his resolution to support it as he may be able. 

In the appendixes appear complete reprints of the Boston Sub- 
committees’ reports, the pertinent excerpts from The Common School 
Journal, the 1919 tests, and sample pages from 1845 textbooks. The 
student of education will appreciate this accessibility of valuable 
sources, and the general reader will find a great deal here that is 
surprisingly free from dryness. 

M. H. WILi Ina. 





MAKING PsycHoLoGy CONCRETE 


Problems in Psychology, by A. J. Snow. New York: Henry Holt and 
Co., 1923. Pp. VI + 115. 


The student of psychology too often leaves his courses with a well- 
filled notebook, a new vocabulary of psychological terms, and a 
variable knowledge of psychological theory, but with little experience 
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in practical application of his knowledge. This little book of 115 
pages presents varying points of view and illustrates from everyday 
behavior and experience which compel the student to organize his 
knowledge and apply it to concrete problems. 

The book contains 778 questions or problems divided into 26 
selections—the number of problems in a given section depending on 
the scope of the topic and a final review section of 210 problems. The 
problems can be used with any textbook and any section can be used 
independently. The book should prove very useful to the teacher who 
is seeking a means of making his subject-matter more vital, or to the 
student who wishes practice in working the raw material of his informa- 
tion into a product that he can handle with confidence and skill. 

CrciLe M. CoLiorton. 





An ATTEMPTED SURVEY OF THE FIELD oF MENTAL TESTING 


Elementary Textbook of Mental Measurements, by Ernest Burton Skaggs. 
Ann Arbor, Mich.: George Wahr, 1923. Pp. 169. 


A book which leaves the reader with a feeling of confusion and 
uncertainty because of its lack of organization and its faulty English 
construction probably will not be wholly successful as a text. While 
the author aims “to give the student a broad and general survey of the 
field of mental measurement present, past, and future,”’ in the opinion 
of the reviewer, he fails of accomplishment. If it were not reported 
as ‘‘the outgrowth of over seven years of work,” one would be justified 
in considering the book the work of a beginner in the field. In view 
of the author’s comprehensive aim and of the seven years of work the 
reader will be surprised to hear that, exclusive of the Appendix, there 
are 144 pages in the 15 chapters. ‘General Intelligence’’ and 
“General Intelligence: Heredity and Training’? are discussed in 
separate chapters. The applications of mental measurements fill 
the six pages of Chap. IV, while in Chap. VIII, seven pages are used 
for the discussion of the principles of constructing and applying mental 
measurements. The constancy of the intelligence quotient is men- 
tioned in a paragraph of eight lines on page 44; the IQ is explained in 
the next chapter and the evidence in favor of the constancy of the IQ 
is cited in the last chapter. 

Mental tests are classified in Chap. II but not until Chap. IX are the 
tests themselves discussed. Chapter XIV is simply an amplification 
of four of the topics classified in Chap. II. The paragraphs on meas- 
urement of character traits gives one the impression that any trait of 
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a person’s character can be measured by objective tests but the reader 
is left without specific information as to the names of such tests, and 
there is no mention of a source of such information or of such tests. 

The author states that there is at the present time no single book 
which gives the student a comprehensive view of the field of mental 
measurements, but Pintner’s Intelligence Testing does that very thing 
rather effectively. The author deplores the fact that we have no 
standardized tests for infants. In his ‘(Handbook of Mental Tests,’’ 
Kuhlman gives tests for children of 3 months, 6 months, 12 months, 
18 months and 2 years. 

The chapter on minimal statistical essentials would have been 
omitted “if the student knew anything about it, but such is not the 
case.” One wonders whether the student will have a clear mental 
picture of a normal curve of distribution defined as ‘‘a curve of error, 
a symmetrical bell shaped curve with definite mathematical proper- 
ties.”” Or will he instantly recognize a coefficient of correlation after 
reading ‘‘a letter as R, r.P, etc. as used to signify the coefficient of corre- 
lation’? The student reads in this text that the number of cases in- 
volved has nothing to do with the choice of a method of computing 
correlation. The author recommends that the student use Rugg’s 
“Statistical Methods Applied to Education,” yet Rugg says, referring 
to this matter of computing the coefficient of correlation, ‘‘Use the 
Rank method only when N is small (say less than 30).” 

A few illustrations will justify the criticism as to faulty English 
construction. 

“In order to understand better these tests it is desirable to briefly 
follow the evolution of the tests designed by Binet.” 

“Thus in learning a maze or puzzle box, the first trials are chance 
and if one person happens to get through the maze without many errors, 
e.g., he can not be compared to the person who was not so fortunate on 
the first trial.’’ 

“To either understand.” ‘To continually look up.” 

“The sounder tests can be modified such that it resists the ability 
of the subject to resist distractions.” 

“These superior intelligence children.” 

“Many universities and colleges have a personnel bureau usually 
consisting of at least one member who is a psychologist.” 

“Out of the various reactions of an individual we take some as 
standing out sufficiently by themselves as to strike attention.” 

“The data is in quantitative terms.” ‘‘Under this data. .. ”’ 
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“Those interested in character analysis should confer the two books 
by H. L. Hollingsworth.” 

Three chapters are followed by short lists of references, but other- 
wise all references are given in footnotes “for they are more apt to be 
seen there by the student than elsewhere.’”’ Authors’ names are not 
listed in the index and particular references can be found only by a 
general search. “A single unifying text’’ may be greatly needed in 
the field of mental measurements, but this book does not meet the need. 

CreciLE M. CoLuorton. 





NoRMAL IN MENTAL LEVEL, BUT— 


The Unstable Child, by Florence Mateer. New York: D. Appleton 
and Company, 1924. Pp. XII + 471. 


Psychologists, teachers, parents and social workers will be grateful 
to this author for a category under which large numbers of perplexing 
problem cases may be grouped. After discussing the history and 
significance of mental tests and of classifications resulting from their 
use, this clinician maintains that no matter what the child’s mental 
level, the intelligence which he has may function efficiently, or ineffi- 
ciently, peculiarly, disastrously or unpredictably. Facts gathered with 
scientific impartiality are mustered to show that such deviations are 
not only more significant as explanations of socially maladjusted indi- 
viduals, but far more hopeful. The diagnosis of psychopathy or 
instability challenged this worker to an experimental evaluation of 
various types of corrective, therapeutic and preventive practice. 

The book is divided into two succinct parts. The first 11 chapters 
are grouped under the sub-title ‘“‘The Unstable Child in Theory,” 
and the remaining 10 chapters are concerned with the description 
and interpretation of practice. Case studies are grouped to illustrate 
the basis for generalizations and conclusions. This array of data 
should enable workers who have dealt with problems of the unstable 
or maladjusted child to formulate their own judgments as to the 
soundness of the hypothesis. Some of the conclusions are astounding. 
Others are so fully in accord with common sense as to require no 
further proof. 

The straightforward objective treatment of certain topics is refresh- 
ing when contrasted with the morbid manner of the confirmed psycho- 
analyst. The practical implications and ramifications of the problem 
should give this volume an unprejudiced but critical hearing, and a 
varied host of readers. 
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